Historical spelling normalization

21
Background Data and Methods Results Conclusions Historical spelling normalization. A comparison of two statistical methods: TICCL and VARD2 Martin Reynaert, Iris Hendrickx and Rita Marquilhas Tilburg University, The Netherlands and Centro de Lingu´ ıstica, Universidade de Lisboa, Portugal November 29, 2012 ACRH-2 2012 Historical spelling normalisation

description

2012 (Nov.) Martin Reynaert, Iris Hendrickx, and Rita Marquilhas. «Historical spelling normalization. A comparison of two statistical methods: TICCL and VARD2», Second Workshop on Annotation of Corpora for Research in the Humanities, Universidade de Lisboa, Lisboa.

Transcript of Historical spelling normalization

Page 1: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

Historical spelling normalization. A comparison oftwo statistical methods: TICCL and VARD2

Martin Reynaert, Iris Hendrickx and Rita Marquilhas

Tilburg University, The Netherlands and Centro de Linguıstica, Universidade deLisboa, Portugal

November 29, 2012

ACRH-2 2012 Historical spelling normalisation

Page 2: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

Background & Motivation

Aim: Automatic spelling variation reduction in a historical corpus

Goal was to reduce the problem of spelling variations in thePortuguese CARDS-FLY corpus of personal letters written in the16th to the 20th century.

This corpus aims to provide a digital version of the letters whilekeeping and recording as much as possible from the originalhandwritten letters including all spelling variation. For certaintypes of research or for querying the corpus, this variation canprevent good results.

ACRH-2 2012 Historical spelling normalisation

Page 3: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

Overview

Origin of the personal letters

Introduction

How does the corpus look?

What data set did we use for the experiments?

Description of the two tools: VARD2 and TICCL

Results

Discussion

ACRH-2 2012 Historical spelling normalisation

Page 4: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

The corpus

The CARDS-FLY corpus1 is ongoing work [Marquilhas, 2012] andaims to collect a total of 4000 personal letters. Currently 3455letters have been transcribed. The letters are manually transcribedinto an electronic XML-TEI file format including rich and detailedhistorical and sociological meta-data.

Origin of the personal letters

1500-1800: from religious legal proceedings, as evidence usedby the Inquisition,

19th C: legal evidence, in criminal cases heard by thePortuguese Royal Appeal Court,

20th C : soldiers who fought in World War I or in thePortuguese Colonial War, political prisoners and emigrants.

1CARDS-FLY corpus: http://alfclul.clul.ul.pt/cards-fly/ACRH-2 2012 Historical spelling normalisation

Page 5: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

Letter from Margarida to her lover, Jose, 1778Ciphered words are in the Masonic - or Pigpen - code and they refer to religious and Inquisition concepts

ACRH-2 2012 Historical spelling normalisation

Page 6: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

Manual transcription of the letter in XML (TEI v.5)

Figure: Full description at: http://alfclul.clul.ul.pt/cards-flyACRH-2 2012 Historical spelling normalisation

Page 7: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

Aim: Spelling normalization of the transcription

Figure: English translation: I have more than once asked Your Honourand begged Your Honour to leave me alone. But Your Honour hasinsisted on defying me, dishonouring me, lessening me, engaging in gossipabout me at every corner, both by words spoken and by letters written towhoever you choose. I remind you, speaking as a friend...

ACRH-2 2012 Historical spelling normalisation

Page 8: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

Data set

For these experiments

Random subset of 200 letters from the CARDS-FLY corpus.

Tokenised, and names are converted to string ‘NAME’.

Normalisation and POS manually verified by a linguist.

This data set was split into 100 letters for training the tools,and 100 for the evaluation set.

Evaluation scores are computed with recall, precision andF-score.

ACRH-2 2012 Historical spelling normalisation

Page 9: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

Statistics

Table: Statistics for the evaluation set of 100 letters, divided into the four time periods. # Tok/file shows the

average number of tokens per letter, ‘#Norm/file’ the average number of manual spelling corrections per letter and

‘% Norm/tok’ is the percentage of all tokens that is normalised.

Period Files Tok #Tok/file #Norm/file %Norm/tok1500-1700 10 2262 226.2 56.9 25.21701-1800 28 13913 496.9 120.8 24.31801-1930 43 14343 333.6 60.7 18.11931- 1974 19 6817 358.8 16.1 4.2

ACRH-2 2012 Historical spelling normalisation

Page 10: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

VARD2 normalization tool

VARD2 developed for historical English and works as follows:

VARD2 is first trained on data to set the parameters of the tool.

Each word is checked against a modern lexicon.

Unknown words are potential spelling variants.

For each variant, generate candidate modern counterparts using theHDBP2 variant list, character rewrite rules and a Soundex algorithmto find phonetically similar counterparts.

Each candidate gets a confidence weight.

If above threshold, candidate replaces variant.

We replaced the English resources with Portuguese ones, re-usingseveral existing resources.

2Historical Dictionary of Brazilian PortugueseACRH-2 2012 Historical spelling normalisation

Page 11: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

Text-Induced Corpus Clean-up: Introduction

TICCL for TYPOS and OCR-errors

Tool to perform large scale, unsupervised spelling correction ofcorpora.

Spelling correction = reduction of lexical variation caused bytypos, OCR-errors, historical orthographical changes...

Prototype developed during a pilot project by invitation of theNational Library, The Hague.

Production version developed according to KB specifications,second half 2008.

Development continues, Open Source release soon.

Is to be made multilingual, first paper on Portuguesepresented here.

ACRH-2 2012 Historical spelling normalisation

Page 12: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

TEXT-INDUCED CORPUS CLEAN-UP: BASICRETRIEVAL MECHANISM

Represent identical bags of characters (i.e. word stringssharing the same bag of characters) by an identifyingnumerical value,

Use this value as the index key to the word strings in adatabase

Perform simple calculations to retrieve variants from thedatabase.

ACRH-2 2012 Historical spelling normalisation

Page 13: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

ANAGRAM HASHING

Key(w) =

|w |∑i=1

f (ci )n

A bad hashing function: produces collisions.

Lines up ANAGRAMS: strings consisting of the same bagof characters.

In practice, we use the code value of each character in thestring raised to the power 5.

Values obtained for the string are summed.

ACRH-2 2012 Historical spelling normalisation

Page 14: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

ANAGRAM HASHING II

CAT = anagram of ACT and TAC

A + C + T = 655 + 675 + 845 = 6,692,535,156

C + A + T = 675 + 655 + 845 = 6,692,535,156

ALLOWS FOR ADDITION AND SUBTRACTIONSAME APPLIES FOR WORD COMBINATIONS, PHRASES,SENTENCES...Great for discovering anagrams: citric critic, cosmic comics,pentatonische pistachenotenBASIS FOR TISC: Text-Induced Spelling Correction

ACRH-2 2012 Historical spelling normalisation

Page 15: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

ANAGRAM HASHING III

Given ANAGRAM VALUE (AV): 6,692,535,156

AV(ACT) + 845 (plus T) = TACT

AV(ACT) - 675 (minus C) = AT, TA

AV(ACT) - 845 + 825 (minus T, plus R) = CAR

AV(ACT) - 845 + 785 + 835 (minus T, plus N, plus S) =CANS/SCAN

Focus word approach: take a word and systematicallysearch for its variants, then take the next word..., etc.

OR:

Character Confusion approach: systematically search forall word pairs in the corpus that display a particulardifference in characters for all possible confusions given aparticular edit distance

ACRH-2 2012 Historical spelling normalisation

Page 16: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

The corpusVARD2TICCL

TICCL for Portuguese

TICCL has been converted to Portuguese by providing it witha Portuguese lexicon, which was the same one as used forVARD2.

Derived from the lexicon is a word confusion matrix whichin fact provides the list of all possible confusables (alsoknown as real-word errors in spelling correction).

TICCL has been equipped with absolute correction (cf.Pollock & Zamora 1984).

TICCL has been equipped with bigram correctioncapabilities: only applied to short words in this study.

ACRH-2 2012 Historical spelling normalisation

Page 17: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

Comparison of VARD2 and TICCL

Table: Best-first ranked performance of TICCL and VARD2 on thetokens of the test set. TICCL was trained only on the training set variantlist. VARD2 and TICCL2 were trained on both the training set variantlist and the HDBP-variant list.

Tool acc prec recall f-score

VARD2 94.65 96.99 73.63 83.71TICCL 93.25 94.27 67.96 78.98TICCL2 93.50 94.38 69.33 79.94

ACRH-2 2012 Historical spelling normalisation

Page 18: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

Comparison of VARD2 and TICCL - II

Table: Results on the tokens of the test set of 100 letters measuringTICCL’s 3, 5, 10 and 20 first-best ranking with bigram correction andwith absolute correction. Also shown is the effect of TICCL notperforming bigram correction. Finally, the effects of VARD2 and TICCLnot having been trained/using absolute correction with the variationlist(s)

Tool acc precision recall f-scoreTICCL-bi-rank3 94.11 94.62 72.57 82.14TICCL-bi-rank5 94.35 94.71 73.89 83.01TICCL-bi-rank10 94.55 94.78 74.92 83.69TICCL-bi-rank20 94.66 94.82 75.52 84.08

TICCL-uni-rank20 94.42 95.03 73.99 83.20

VARD2-notraining 90.58 93.79 53.05 67.77TICCL-bi-rank20-noabsolut 89.18 92.03 46.02 61.35

ACRH-2 2012 Historical spelling normalisation

Page 19: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

Discussion of the evaluations

Observations

Both VARD2 and TICCL benefit greatly from absolutecorrection or specific training data

TICCL goes some way towards context-sensitive spellingcorrection, but lacked contemporary bigrams from abackground corpus

TICCL has a ranking problem due to the greatermorphological variation in Portuguese. Might outperformVARD2 if this were solved.

TICCL could also be extended to productively handle largerLevenshtein distances on the basis of gold standard trainingdata, e.g. numerical anagram value for the difference between‘exmo’ and ‘excelentıssimo’ also holds for the plural andfeminine forms.

ACRH-2 2012 Historical spelling normalisation

Page 20: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

Future steps

What to do next?

Study the strengths of VARD and TICCL and see whether wecan combine them in one system.

(Do the right thing and) Give TICCL contemporary bigramsfrom a background corpus.

Full context-sensitive spelling correction is needed for thistype of spelling variation to raise recall above the ∼ 75%ceiling reached now.

ACRH-2 2012 Historical spelling normalisation

Page 21: Historical spelling normalization

BackgroundData and Methods

ResultsConclusions

Thanks!!

Thanks for your attention!

Papers about TICCL are available at:http://ilk.uvt.nl/

Historical spelling normalization. A comparison oftwo statistical methods: TICCL and VARD2

Martin Reynaert, Iris Hendrickx and Rita Marquilhas

Tilburg University, The Netherlands and Centro de Linguıstica, Universidade deLisboa, Portugal

November 29, 2012

ACRH-2 2012 Historical spelling normalisation