ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

28
ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA

Transcript of ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

Page 1: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

ELABORAZIONE DEL LINGUAGGIO NATURALE

STILOMETRIA

Page 2: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

CHI L’HA SCRITTO?

“On the far side of the river valley the road passed through a stark black burn. Charred and limbless trunks of trees stretching away on every side. Ash moving over the road and the sagging hands of blind wire strung from the blackened lightpoles whining thinly in the wind.”

Page 3: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

Studying properties of the writers of documents based only on the linguistic style they exhibit.

In particular using computational tools The best known type of stylometric task: “who wrote

this document?” “Linguistic Style” Features: sentence length, word

choices, syntactic structure, etc. Handwriting, content-based features, and contextual

features are not considered.

STYLOMETRY

Page 4: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

Applications of Stylometry Digital Humanities:

Author attribution: Identification of unknown authors Genre classification Historical study of language change (diachronic

linguistics) Literary analysis

But many other applications as well: Forensics, Anonymity, Plagiarism …

“In some criminal, civil, and security matters, language can be evidence… When you are faced with a suspicious document, whether you need to know who wrote it, or if it is a real threat or real suicide note, or if it is too close for comfort to some other document, you need reliable, validated methods.”

Page 5: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

Who wrote this?

“On the far side of the river valley the road passed through a stark black burn. Charred and limbless trunks of trees stretching away on every side. Ash moving over the road and the sagging hands of blind wire strung from the blackened lightpoles whining thinly in the wind.”

Cormac McCarthy

Page 6: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

7/22

Authorship attribution

• Has been a topic of research since at least mod-19th century (predates computers)

• Interest in– resolving issues of disputed authorship– identifying authorship of anonymous texts– may be useful in detecting plagiarism, and

authorship of computer viruses– used in forensic setting, eg to detect genuine

confessions

Page 7: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

8/22

Classical examples• Did Homer write both the Illiad and the Odyssey?

– both generally attributed to a single individual named “Homer”, but both are derived from long oral tradition

• Did Paul write all the NT Letters of St Paul? – Especially, the authorship of Hebrews has long been debated on

theological grounds• Plato developed his philosophy in the form of dialogues,

putting his own doctrines into the mouth of Socrates his teacher. – Ascertaining the correct chronological order of these dialogues

would help to understand how Plato developed his philosophy• Did Shakespeare write all of his plays?

– Various authors including Bacon and Marlowe are said to have written parts or all of several plays

– “Shakespeare” may even be a nom-de-plume for a group of writers

– two more plays – Edward III and Two Noble Kinsmen – may have been written partly by Shakespeare

Page 8: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

9/22

The Federalist Papers

• 85 articles published in 1787-88 with the aim of promoting the ratification of the new US constitution.

• written by three authors, Jay, Hamilton and Madison, under the pseudonym “Publius”

• Some are of known (and in some cases joint) authorship but 12 are disputed

• Pioneering stylometric methods were famously used by Mosteller and Wallace in the early 1960s to attempt to answer this question

• It is now considered as settled (Madison the author of the disputed papers)

• The Federalist Papers present a difficult but solvable test case, and are seen as a benchmark to test new ideas

Page 9: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

11/22

Some modern examples

• Similarities with private letters helped to identify the style of the Unabomber’s manifesto– Unabomber Theodore Kaczynski perpetrated a

number of bomb attacks on universities and airlines between 1978 and 1995

– Promised to stop if his 35,000-word anti-industrialist “manifesto” was published in major newspapers

– Distinctive writing style and turns of phrase enabled him to be identified

• Authorship of Primary Colors, a work of fiction about preparations for the Democratic primaries which showed the Bill Clinton character in a bad light

Page 10: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

12/22

Some modern examples• Derek Bentley and his disputed murder ‘confession’ (1953)

– Bentley (an illiterate man of low IQ) and another man involved in an armed robbery in which a policeman was shot

– Bentley found guilty and hanged in January 1953– In 1971 author Yallop looked closely at the case, – As well as conflicting ballistic evidence, and some procedurtal errors in

the trial, Bentley’s statement was found to have been doctored by police:

– Contested statement used then every 58 words on average and repeatedly used I then.

– BoE uses then every 500 words, and then I ten times more often than I then. Importantly, witness statement frequencies overall are similar to BoE.

– Police statement ‘genre’ of the time used then every 78 words, and typically used the I then form.

– Derek Bentley acquitted in 1999, posthumously, appeal assisted by a linguistics professor

Page 11: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

13/22

Five approaches to authorship attribution

• Physical evidence– eg carbon dating and handwriting analysis, as in case

of Hitler Diaries. Not relevant to linguistics/stylistics

• Historical evidence– eg did Marlowe or Shakespeare write Edward III? It

was published 1596, 3 yrs after Marlowe’s death, but contains references to the defeat of the Armada (1588)

– “knowledge intensive”, not feasible for computers

Page 12: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

14/22

Authorship attribution

• Cipher-based decryption– idea that authors deliberately encode their names in

text– especially widespread in Bible studies, but also in

Shakespeare-Bacon debate– Penn (1987) used computer analysis to show Bacon

had written a lot of Shakespeare’s plays– easily debunked: see

http://shakespeareauthorship.com/#5b: Ross showed that using the same techniques “proved” that bacon also wrote Spenser’s Faerie Queene, the Bible, Caesar’s Gallic Wars, Hiawatha, Moby Dick and The Federalist Papers (see later)

Page 13: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

15/22

Authorship attribution

• Manual analysis– Much used in forensic linguistics– Detailed analysis of unlimited linguistic traits– Not suitable for computational analysis, but

we’ll look at some examples later

• Computational stylometry

Page 14: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

16/22

Computational stylometry

• Computational stylometry– Involves counting things– So can only look at what is easily countable

• Modern computational stylometry based in Machine Learning SVMs, Genetic Algorithms, Neural Networks,

Bayesian Classifiers… used extensively.

Page 15: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

17/22

Stylometry

• Assumes that the essence of the individual style of an author can be captured with reference to a number of quantitative criteria, called discriminators

• Obviously, some (many) aspects of style are conscious and deliberate – as such they can be easily imitated and indeed often

are– many famous pastiches, either humorous or as a sort

of homage• Computational stylometry is focused on

subconscious elements of style less easy to imitate or falsify

Page 16: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

18/22

Stylometry is not foolproof

• We should be aware of shortcomings– Discriminators are mostly lexical, though some recent

work has looked also at syntactic discriminators– Authors’ styles change, either over time, or

deliberately, eg when writing in different literary genres

– Many techniques rely on large quantities of data• Most of the following techniques are better at

dealing with closed questions– Who wrote this, A or B?– If A wrote these, did they also write this?– How likely is it that A wrote this?– but not Who wrote this?

Page 17: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

19/22

Basic methodologies

• Word or sentence length too obvious and easy to manipulate

• Frequencies of letter pairs strangely successful, though limited

• Distribution of words of a given length (in syllables), especially relative frequencies, ie length of gaps between words of same syllable length.

Page 18: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

How does it work? Linguistic Features Basic Measurements:

Average syllable/word/sentence count, letter distribution, punctuation.

Lexical Density Unique_Words / Total_Words

Gunning-Fog Readability Index: 0.4 * ( Average_Sentence_Length +

100 * Complex_Word_Ratio ) Result: years of formal education required to read

the text.

Page 19: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

21/22

Vocabulary richness

• Based on the idea that author’s vocabulary is more or less constant

• Various measures – Type-token ratio– Simpson’s index (the

chance that two word arbitrarily chosen from text will be the same)

– Yule’s K (occurrence of a given word is a chance occurrence can be modelled as a Poisson distribution)

– Entropy (measure of uniformity)

iNVp

N

rV

N

ppH

N

NVrK

NN

VrrD

ii

r

ii

r

r

each type ofy probabilit theis and

tokensofnumber theis

timesoccur that typesofnumber is where

log

log100 Entropy

10 sticcharacteri sYule'

1

1 index sSimpson'

2

24

Page 20: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

22/22

The Federalist Papers

• 85 papers arguing for the adoption of the US constitution• written by three authors (Jay, Hamilton, Madison)

– 5 authored by Jay – 51 authored by Hamilton – 14 authored by Madison –   3 jointly by Hamilton and Madison – authorship of 12 of them disputed (Hamilton or Madison?)

• Mosteller and Wallace (1964) employed function words such as prepositions, conjunctions, and articles as discriminators.– e.g., the word upon averaged 3.24 appearances per 1,000 words

in the known writings of Hamilton but only 0.23 in the writings of Madison

– 30 “marker words” identified as discriminativeof the two contested authors: upon, whilst, there, on, while, vigor, by, consequently, would, voice

Page 21: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

23/22

Bayesian probability

• Bayes hypothesis reconciles prior hypotheses (in this case based on historical observation) with conditional probabilities based on measurements

• If prior hypothesis (eg that there is a 1:3 chance that Madison wrote the paper) is confirmed by the measurements (eg of features associated with Madison’s style), the result will be neutral

• If prior hypothesis is contradicted by the measurements, result will be much more striking

Page 22: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

24/22

Cumulative sum charts

• Method– Assume authorial “fingerprints” such as percentage of short

words, or words beginning with a vowel– Put two texts together and plot the number of items per sentence

against the cumulative average– If graph has a sharp divergence at the point where the texts are

joined, this shows the authors differ• Highly controversial

– Interpretation of graphs very subjective– But much used in courts!

• Weighted cusum– Slightly sounder footing statistically – eliminates need for

subjective judgment– Still not very accurate compared to other measures

Page 23: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

25/22

Multivariate analysis

• Thanks to computers it is now possible to collect large numbers of different measurements, of a variety of features

• Variants of multivariate analysis– Cluster analysis– Correspondence analysis– Principal components analysis

Page 24: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

26/22

Cluster analysis

• Group objects according to their similarity with respect to a given feature

• Produces a tree diagram or “dendogram”

Page 25: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

27/22

Correspondence analysis

• Example of superlatives in Dickens’ and Smollett’s works – Tabata 2007:

http://www.digitalhumanities.org/dh2007/abstr

acts/xhtml.xq?id=259)

• Count frequency of 242 superlatives in 30 texts

• CA allows classification of associations between variables in a 2d matrix, rows x columns

• D1 distinguishes Dickens from Smollett

• D2

Page 26: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

28/22

Principal components analysis

• Like cluster analysis but can work with much larger range of variables

• PCA is a statistical method for arranging large arrays of data into interpretable patterning match

• “principal components” are computed by calculating the correlations between all the variables, then grouping them into sets that show the most correspondence

• each “set” is a “component”, or “dimension”

Page 27: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

29/22

Final word

• Many of these techniques are also used to identify different genres rather than different authors– especially PCA, where the dimensions can be

characterised• (In fact, cluster analysis and PCA illustrations

were taken from such a study!)• An interesting question: how well do they work

on pastiches?– If interested, see H Somers & F Tweedie “Authorship

attribution and pastiche”, Computers and the Humanities 37 (2003), 407-429.

Page 28: ELABORAZIONE DEL LINGUAGGIO NATURALE STILOMETRIA.

30/22

RIFERIMENTI

D. Holmes “Authorship attribution” Computers and the Humanities 28 (1994), 87-106.

D. Holmes “The Evolution of Stylometry in Humanities Scholarship” Literary and Linguistic Computing 13 (1998), 111-117. http://llc.oxfordjournals.org/cgi/reprint/13/3/111.pdf

T. McEnery & M. Oates “Authorship identification and computational stylometry” in Dale et al (eds) Handbook of Natural Language Processing, New York (2000): Dekker, chapter 23.30