1 The Ferret Copy Detector Finding short passages of similar texts in large document collections...

1

The Ferret Copy Detector

Finding short passages of similar texts in large document collections

Relevance to natural computing:• System is based on processing short sequences• Exploits sequencing characteristics of natural language.• It has a natural analogue in human sequencing processors.

Caroline Lyon, University of Hertfordshire, [email protected]

2

Human sequencing functions

Primitive sequencing processors in the sub-cortical basal ganglia part of the brain control motor functions, e.g. walking.

These sub-cortical sequencing processors also contribute to cognitive processing, e.g. language, complementing cortical functions.

Reference

Human Language and our Reptilian Brain, P Lieberman, 2000

3

Sequencing in human speech and language

Sequential processing is necessary at many levels:• Phonetic• Syllabic• Lexical• Syntactic

Phonetics: speakers must control a sequence of independent motor acts to produce speech sounds.

4

Sequencing in speech and language (2)

• Phonetic segments can only be combined in certain ways to produce phonemes and then syllables.

• Different languages have different phonemic systems, but all have sequential constraints.

• Syllables combine to make words, which combine to make phrases, which combine to make sentences.

All have constraints.

5

The need for sequential processing

• Many of our most frequently used words arehomophones:

<to, too, two> <for, four> <here, hear>• True of other languages too.• This does not seem to impede communication.• Our primary method of disambiguation is

through sequential processing of short strings of words:e.g. < I want to/too/two eggs > only has one

interpretation.

6

Alternative method of avoiding word ambiguity

A recent mathematical model of human language assertsthat there are unique mappings from sounds to

meanings,that absence of word ambiguity is a mark of evolutionaryfitness.[ Computational and evolutionary aspects of language, M Novak et al.Nature, June 2002, vol 417, pp 611-617; and other references]

This is a logical suggestion, but it is not how humanlanguage works.

7

Language models

• Language can be modelled by a regular grammar – a linear sequence of symbols.

• Chomsky showed that this is inadequate.• However, it has produced effective practical

applications. Speech recognition systems are typically based on Markov models.

• The Ferret is based on a model of simple linear sequences.

8

Concepts underlying the Ferret (1)

A text can be converted into a set of short sequences of adjacent words – bigrams, trigrams etc.

Example with trigramsA storm was forecast for today

becomes(a storm was) (storm was forecast) (was forecast for) (forecast for today)

9

Concepts underlying the Ferret (2)

To find similar passages in two documents, both texts

are converted to sets of trigrams.

Then the sets are compared for matches.

Independently written texts have a sprinkling ofmatches. But copied passages (not necessarily

identical) have a significant number of matches, above a

threshold.

10

Zipfian distribution of words(or why this method works)

A small number of words occur frequently, but most words occur rarely.

This phenomenon is more pronounced for bigrams and trigrams.

The characteristic form of a text based on trigrams will have a few frequent trigrams, but most will be rare.

ReferencesPrediction and Entropy of Printed English, C Shannon, 1950Many publications in speech recognition literature.

11

Statistics from Wall Street Journal corpus (1)

Number of words

Number of distinct trigrams

Number of singleton trigrams

% of trigrams that are singletons

972,868 648,482 556,185 86%

4,513,716 2,420,168 1,990,507 82%

38,532,517 14,096,109 10,907,373 77%

From Handbook of Standards and Resources for Spoken Language Systems, Gibbon et al., 1997

12

Statistics from Wall Street Journal (WSJ) corpus (2)

• WSJ is a narrow domain.• Topics are revisited.• On close dates, subjects may be very similar.

Yet after over 38 million words have been analyzed, a

new article will on average have 77% new trigrams.

13

The Ferret and speech recognition systems

“Sparse data” is a key problem in speech recognition.New input to a system typically contains a number of previously unseen trigrams.

Ferret exploits this problem: sparse data means a text

has characteristic features that do not appear in other

texts, unless passages are copied.

14

Comparison metrics in the Ferret

Set theoretic measures are used to compare Documents.

2 texts of comparable length have Resemblance, RIf NA and NB are the sets of trigrams in texts Aand B, then:

There is a threshold for R, found empirically, abovewhich texts are suspiciously similar

______

BA

BA

NN

NNR

15

Benchmarking Resemblance threshold

Experiments were conducted on The Federalist Papers.

This set of essays, the basis of the AmericanConstitution, is very well known.• 81 of the papers were used. • 2 authors.• All are on related topics.

The maximum measure of resemblance between twoof these essays suggests an upper limit on similaritybetween independently written texts.

16

The Ferret process

To find similar passages in large document collections

• Documents are converted to .txt from Word (or shortly from .pdf)

• Each text is converted to a set of trigrams;in this form, each is compared with each other.

3. A table showing Resemblance between each pair of texts is displayed in ranked order. The user can select any pair, display side by side, and see matching sections highlighted, save if wanted.

17

The Ferret as plagiarism detector for students’ work

• Detects plagiarism or collusion in work from large cohorts of students.• Short sections of similar text can be identified, with

some insertions and deletions.• Documents from the web can be included in a semi-

automatic process: top 50 hits from a search areconverted to .txt and added to other texts.

Reference: Experiments in Electronic Plagiarism Detection, C Lyon et

al. TR 388, Computer Science Dept., University of Hertfordshire, 2003

18

Ferret demonstration

Aim: to find if there are similar passages in any two documentsData:• 320 texts of 10,000 words, taken from Gutenberg site.

Simulated copying by pasting passages of 100 to 400 words from one text into another.

• 100 texts of student work, 2000 – 5000 words each.• 34 documents from Dutch students.• Please bring other data to try.

1 The Ferret Copy Detector Finding short passages of similar texts in large document collections...

Documents

Transcript of 1 The Ferret Copy Detector Finding short passages of similar texts in large document collections...