A Stylometry System for Authenticating Students Taking...

12
A Stylometry System for Authenticating Students Taking Online Tests Alex Castro, Ola Sotoye, Linda Torres, Greg Truley, Vinnie Monaco, and John Stewart Seidenberg School of CSIS, Pace University, White Plains, NY 10606, USA [email protected] , [email protected] , [email protected] , [email protected] , [email protected] Abstract Keystroke Biometric Systems is the measurement of typing characteristics that are believed to be unique to an individual and can be very difficult to duplicate. Stylometry Biometric Systems is the measurement of the frequency of text file features such as number of alphabetic characters/total number of characters, number of uppercase characters/number of alphabetic characters, number of two-letter words/total number of words, number of commas(,)/total number of the eight punctuation symbols and other features. Our team will use the data collected by the Keystroke Project team and utilize a modified feature extraction component of the Keystroke application. We will use the same classifier used in the Keystroke system to feed the test and train files and measure accuracy by obtaining results for FRR (False Acceptance Rate), FRR (False Rejection Rate) and performance results. Getting preliminary results will show that the system is functional and will ensure and provide an indication of the usefulness of Stylometry for identity of online test takers. 1. Introduction Stylometry is the study of the unique linguistic styles and writing behaviours of individuals in order to determine authorship. Stylometry is used to attribute authorship to anonymous or disputed documents. It has legal as well as academic and literary applications. Stylometry has been used to determine (or narrow the possibilities of) the authorship of historic documents, of ransom notes, and of other documents in forensics, etc. Authorship analysis is a process of examining the characteristics of a piece of writing to draw conclusions on its authorship. [7] The goal of this project is to use a combination of the Keystroke and Stylometry tools already in place, to ensure that online test takers are who they say they are. This will be accomplished by using a combination of the Keystroke and Stylometry features. Initial samples will be taken from student’s keyboard strokes and from text components file characteristics. This sample will be used to correctly assign authorship of text and, both will be used to authenticate test takers. Keystroke and Stylometry tools differ in the data they gather. Keystroke has more of an emphasis on timing features while in Stylometry there is an emphasis on the frequency of text features. Keystroke behaviour is used to recognize or verify the identity of a person. Behavioural biometrics such as 1

Transcript of A Stylometry System for Authenticating Students Taking...

Page 1: A Stylometry System for Authenticating Students Taking ...csis.pace.edu/~ctappert/srd2011/p-stylometry.doc  · Web viewLexical features are word- or character-based statistical measures

A Stylometry System for Authenticating Students Taking Online Tests

Alex Castro, Ola Sotoye, Linda Torres, Greg Truley, Vinnie Monaco, and John StewartSeidenberg School of CSIS, Pace University, White Plains, NY 10606, USA

[email protected], [email protected], [email protected], [email protected], [email protected]

AbstractKeystroke Biometric Systems is the measurement of typing characteristics that are believed to be unique to an individual and can be very difficult to duplicate. Stylometry Biometric Systems is the measurement of the frequency of text file features such as number of alphabetic characters/total number of characters, number of uppercase characters/number of alphabetic characters, number of two-letter words/total number of words, number of commas(,)/total number of the eight punctuation symbols and other features. Our team will use the data collected by the Keystroke Project team and utilize a modified feature extraction component of the Keystroke application. We will use the same classifier used in the Keystroke system to feed the test and train files and measure accuracy by obtaining results for FRR (False Acceptance Rate), FRR (False Rejection Rate) and performance results. Getting preliminary results will show that the system is functional and will ensure and provide an indication of the usefulness of Stylometry for identity of online test takers.

1. Introduction

Stylometry is the study of the unique linguistic styles and writing behaviours of individuals in order to determine authorship. Stylometry is used to attribute authorship to anonymous or disputed documents. It has legal as well as academic and literary applications. Stylometry has been used to determine (or narrow the possibilities of) the authorship of historic documents, of ransom notes, and of other documents in forensics, etc. Authorship analysis is a process of examining the characteristics of a piece of writing to draw conclusions on its authorship. [7]

The goal of this project is to use a combination of the Keystroke and Stylometry tools already in place, to ensure that online test takers are who they say they are. This will be accomplished by using a combination of the Keystroke and Stylometry features. Initial samples will be taken from student’s keyboard strokes and from text components file characteristics. This sample will be used to correctly assign authorship of text and, both will be used to authenticate test takers. Keystroke and Stylometry tools differ in the data they gather. Keystroke

has more of an emphasis on timing features while in Stylometry there is an emphasis on the frequency of text features.

Keystroke behaviour is used to recognize or verify the identity of a person. Behavioural biometrics such as the way we sign our name or type in our password are unique as well. The way and the manner in which we type on our computer keyboard varies from individual to individual and is considered to be a unique behavioural biometric. Keystroke Dynamics or Recognition is probably one of the easiest biometrics forms to implement and manage. This is so because at the present time, Keystroke Recognition is completely a software based solution. There is no need to install any new hardware and even software. All that is needed is the existing computer and keyboard that is already in place and use.

Stylometry (also referred to as authorship analysis) is defined as the “statistical analysis of writing style.” Four important characteristics of Stylometric analysis are the tasks, stylistic features, classification techniques, and parameters (i.e., factors influencing authorship analysis performance, such as number of classes, amount oftext noise). [1] Stylistic features are the attributes or writing style markers that are the most effective authentication of authorship. The vast array of stylistic features includes lexical, syntactic, structural, content-specific, and idiosyncratic style markers. Lexical features are word- or character-based statistical measures of lexical variation. These include style markers such as sentence/line length, vocabulary richness, and word length distributions. Syntactic features include function words, punctuation, and part-of-speech. Structural features, which are useful for online text, include attributes relating to text organization and layout. Content-specific features are important key words and phrases pertaining to certain topics. For example, content-specific features on a discussion of computers may include “laptop” and “notebook.” Idiosyncratic features include misspellings, grammatical mistakes, and other usage anomalies. Such features are extracted using spelling and grammar checking tools. Over 1,000 different features have been used in previous authorship analysis research markers. Certain feature categories may

1

Page 2: A Stylometry System for Authenticating Students Taking ...csis.pace.edu/~ctappert/srd2011/p-stylometry.doc  · Web viewLexical features are word- or character-based statistical measures

be more effective at capturing style variations in different contexts.

2. Stylometry Features- Methodology

For our Project and in conjunction with our customer requirements we used 55 features (selecting 49 from the Character based features and 1-6 of the Word based features) to obtain results. For future work syntactic based features can be added.

The table below illustrates the sample feature list used.

Character-based features:1. number of alphabetic characters/total number of characters2. number of uppercase alphabetic characters/total number of alphabetic characters3. number of digit characters/total number of characters4. number of space characters/total number of characters5. number of vowel (a,e,i,o,u) characters/number of alphabetic characters6. number of alphabetic "a" (upper or lowercase) characters/number of vowel characters7. number of alphabetic "e" (upper or lowercase) characters/number of vowel characters8. number of alphabetic "i" (upper or lowercase) characters/number of vowel characters9. number of alphabetic "o" (upper or lowercase) characters/number of vowel characters10. number of alphabetic "u" (upper or lowercase) characters/number of vowel characters11. number of most frequent consonant characters (t,n,s,r,h)/number of alphabetic characters12. number of alphabetic "t" (upper or lowercase) characters/number of most frequent consonant characters13. number of alphabetic "n" (upper or lowercase) characters/number of most frequent consonant characters14. number of alphabetic "s" (upper or lowercase) characters/number of most frequent consonant characters15. number of alphabetic "r" (upper or lowercase) characters/number of most frequent consonant characters16. number of alphabetic "h" (upper or lowercase) characters/number of most frequent consonant characters17. number of second most frequent consonant characters (l,d,c,p,f)/number of alphabetic characters18. number of alphabetic "l" (upper or lowercase) characters/number of second most frequent consonant characters19. number of alphabetic "d" (upper or lowercase) characters/number of second most frequent consonant characters20. number of alphabetic "c" (upper or lowercase) characters/number of second most frequent consonant characters21. number of alphabetic "p" (upper or lowercase) characters/number of second most frequent consonant characters22. number of alphabetic "f" (upper or lowercase) characters/number of second most frequent consonant characters23. number of third most frequent consonant characters (m,w,y,b,g)/number of alphabetic characters24. number of alphabetic "m" (upper or lowercase) characters/number of third most frequent consonant characters25. number of alphabetic "w" (upper or lowercase) characters/number of third most frequent consonant characters26. number of alphabetic "y" (upper or lowercase) characters/number of third most frequent consonant characters27. number of alphabetic "b" (upper or lowercase) characters/number of third most frequent consonant characters

28. number of alphabetic "g" (upper or lowercase) characters/number of third most frequent consonant characters29. number of least frequent consonant characters (j,k,q,v,x,z)/number of alphabetic characters30. number of consonant-consonant digrams/total number of alphabet letter digrams31. number of "th" digrams/total number of consonant-consonant digrams32. number of "st" digrams/total number of consonant-consonant digrams33. number of "nd" digrams/total number of consonant-consonant digrams34. number of vowel-consonant digrams/total number of alphabet letter digrams35. number of "an" digrams/total number of vowel-consonant digrams36. number of "in" digrams/total number of vowel-consonant digrams37. number of "er" digrams/total number of vowel-consonant digrams38. number of "es" digrams/total number of vowel-consonant digrams39. number of "on" digrams/total number of vowel-consonant digrams40. number of "at" digrams/total number of vowel-consonant digrams41. number of "en" digrams/total number of vowel-consonant digrams42. number of "or" digrams/total number of vowel-consonant digrams43. number of consonant-vowel digrams/total number of alphabet letter digrams44. number of "he" digrams/total number of consonant-vowel digrams45. number of "re" digrams/total number of consonant-vowel digrams46. number of "ti" digrams/total number of consonant-vowel digrams47. number of vowel-vowel digrams/total number of alphabet letter digrams48. number of "ea" digrams/total number of vowel-vowel digrams49. number of double-letter digrams/total number of alphabet letter digrams

Word-based features:1. number of one-letter words/total number of words (letter =

alphabetic character, upper or lowercase)2. number of two-letter words/total number of words3. number of three-letter words/total number of words4. number of four-letter words/total number of words5. number of five-letter words/total number of words6. number of six-letter words/total number of words7. number of seven-letter words/total number of words8. number of words having eight or more letters/total number of

words9. number of short words (less than four characters)/total number

of words10. number of letters in all words/total number of words (i.e., average word length)11. number of different words/total number of words

Syntactic features:1. total number of the eight punctuation symbols (period, comma, question mark, exclamation point, semicolon, colon, single quote, double quote)/total number of characters2. number of periods (.)/total number of the eight punctuation

2

Page 3: A Stylometry System for Authenticating Students Taking ...csis.pace.edu/~ctappert/srd2011/p-stylometry.doc  · Web viewLexical features are word- or character-based statistical measures

symbols3. number of commas (,)/total number of the eight punctuation symbols4. number of question marks (?) or exclamation points (!)/total number of the eight punctuation symbols5. number of semicolons (;) or colons (:)/total number of the eight punctuation symbols6. number of single quotes (') and double quotes (")/total number of the eight punctuation symbols7. total number of non-alphabetic, non-punctuation, and non-space characters (0,1,2,3,4,5,6,7,8,9,@,#,$,%,etc.)/total number of characters8. total number of digit characters/total number of non-alphabetic, non-punctuation, and non-space characters9. total number of articles (a, an, the)/total number of words10. total number of "the" articles/total number of articles11. total number of "a" or "an" articles/total number of articles12. total number of common conjunctions (after, although, and, as, because, before, both, but, either, even, for, how, however, if, neither, nor, now, once, only, or, provided, rather, since, so, than, that, though, till, unless, until, when, whenever, where, whereas, wherever, whether, while, yet)/total number of words13. total number of common interrogatives (where, which, what, who, whom, whose, when, how, why, whether)/total number of words14. total number of common prepositions (aboard, about, above, across, after, against, along, amid, among, anti, around, as, at, before, behind, below, beneath, beside, besides, between, beyond, but, by, concerning, considering, despite, down, during, except, excepting, excluding, following, for, from, in, inside, into, like, minus, near, of, off, on, onto, opposite, outside, over, past, per, plus, regarding, round, save, since, than, through, to, toward, towards, under, underneath, unlike, until, up, upon, versus, via, with, within, without)/total number of words15. number of first-person personal pronouns (I, me, mine, my, myself, our, ours, ourselves, us, we)/total number of personal pronouns (first, second, and third person)16. number of second-person personal pronouns (you, your, yours, yourself, yourselves)/total number of personal pronouns17. number of third-person personal pronouns (he, her, hers, herself, him, himself, his, it, its, itself, she, their, theirs, them, themself, themselves, they)/total number of personal pronouns18. total number of personal pronouns/total number of words

Figure 1. Stylometry Features List

To use Stylometry as a tool for authenticating the authorship of literary works, legal documents, or simple written communication, such as e-mails or personal notes, targeted features must be quantifiable and distinctive. Thus, at its basics, writing styles are as unique to the individual as fingerprints; a writing style is an unconscious activity to written communication.

Baily [2] identifies general properties for text features: “These features should be salient, structural, frequent and easily quantifiable and relatively immune from conscious control.” Though a process of quantitative analysis, where these features are used to classify writing style, it is believed that the writing characteristics will identify the author.

Writing characteristics can be classified into two different approaches: qualitative and quantitative. The qualitative approach assesses errors and personal behaviour of the

authors, also known as idiosyncrasies. According to Chaski [3], this approach could be quantified through profiling and data storage; however, historically, databases have been woefully inadequate. Database profiles provide support for identification in feature styles and without it conclusions can lead to experimental bias. To avoid such bias, Koppel and Schler [5] proposed the use of 99 error features to feed different classifiers. The best result reported was about 72% of recognition rate.

The second approach is quantitative and computational. This approach is considered the bases of Stylometry, which is centred on features that can be numerically counted and computed. (i.e. word length, phrase, length, sentence length, vocabulary frequency, distribution of words of different lengths). Thus, theoretical linguistics provides a paradigm by which conclusive analysis can be drawn as a result of mathematical data manipulation. Examples can be found in Chaski [3]. Experimental results show that usually this approach provides better results than qualitative analysis.

In the scientific literature, linguistic features are used for author verification. In Chaski [3], author verification is discussed in which differences between scientific and replicable research methods are offered. Thus, through empirical data testing, hypotheses are researched and conclusions are grounded on repeatable evidence. In Chaski’s [4] work, nine empirical hypotheses were use to identify authors: Vocabulary Richness (number of distinct words), Hapax Legomena (number of works occurring once), Readability Measures, Content Analysis, Spelling, Errors, Grammatical Errors, Syntactically Classified Punctuation, Abstract Syntactic Structures.

Vocabulary Richness is defined as the ratio of the number of distinct words (type) to the number of total words (token). Hapax Legomena (HL) is the ratio of the numbers of words occurring once (HL) to the total number of words. Readability Measures are those which focus how easily a document is understood, derived from calculations on a sentence length and word length. Content Analysis is method by each word is classified semantic category, and statistically analyzes word pair distances. Spelling Errors calculates misspelled words. Grammatical Errors test errors such as sentence fragment, run-on sentence, subject-verb mismatch, tense shift, wrong verb form, and missing verb [4].

Syntactically Classified Punctuation identifies end-of-sentence period, comma separating main and dependent clauses, comma in list, etc. And, Abstract Syntactic Structures use computational analyzes to identify syntactic patterns. It uses verb phrases structure as differentiating features. Online stylometric analysis is concerned with categorization of authorship style for

3

Page 4: A Stylometry System for Authenticating Students Taking ...csis.pace.edu/~ctappert/srd2011/p-stylometry.doc  · Web viewLexical features are word- or character-based statistical measures

online texts. For online analysis, message-level attempts to categorize individual texts (e.g. emails), where identity-level analysis is concerned with classifying identities belonging to a particular entity. Message-level analysis is not highly scalable to larger numbers of authors in cyberspace due to difficulties in consistently identifying tests shorter than 250 words [Forsyth and Holmes 1996]. Identity-level analysis attempts to categorize identities based on all texts written by that identity [7].

2.2. Features Data Collection Extraction and Classification

In order to collect data for our samples, students from Professor Stewart’s class were given a final exam using the Keystroke Entry System (KES) on-line test-taker system. The KES is a web-based application that provides an interface for instructors as well as an interface for student to answer test questions. For this test, a group of 15 students from Lake Erie College completed an actual test as their final exam.

KES has been customized for use by students to take tests. The new version of the system allows users to login using only a username and password. The questions displayed are modified through a text file and allows for simple instructor modification.

To use the KES system, the student first enters some demographic information such as name and type of computer. They will register their demographic data, which is stored into the database and retrieved later when the student logs in to take the test. Once the demographic questions are answered, the student is prompted to answer a series of questions that the professor has supplied (i.e. questions for a mid-term or final exam). Note: A completed sample is at least 300 keystrokes. The professor also has an interface to setup and customize questions that will be used for the tests. The test taker will answer questions from a list of choices and users can write what they want to answer the questions. The data from the KES system is presented in text format with data for all the test samples. For our test, there were 15 students each with 8 questions each. The resulting output is a file containing raw keystroke and text data for each test-taker. Since we have 15 subjects with 8 questions each, the output files are modified with the keystroke data removed leaving only the text portions of each file.

For the data to be successfully run through the Stylometry Extractor program, it has to be in the correct format. Each of the student files need to be modified. The keystroke portion of the raw data was not needed as we are only testing for Stylometry features. The text portion

of the file (located at the bottom of the files) was used for testing. These files are then feed into the Extractor program. Since the system is designed for extracting keystroke data, a significant part of the effort was to develop a modified Stylometry Features extractor to extract Stylometry features instead of Keystroke data. Note: During the initial review of our project deliverables a new features extraction program was considered, as modifying the keystroke java file was determined to be complicated (The previous code comments were not available during the discovery phase). A newly modified features extraction program that reads the raw data text files and extracts 55 of the Stylometry features was finally produced. This code, once successfully downloaded, can be run by executing the command:

java –cpclojure.jar:clojure-contrib-1.2.0.jar clojure.main stylometry-main.

In the specified directory path and with the appropriate version of Java JRE (1.6) loaded, the command will prompt for the directory where modified KES raw data files are stored.

Figure 2. Stylometry Extractor

A resulting stylometry.features file will be the output.

4

Page 5: A Stylometry System for Authenticating Students Taking ...csis.pace.edu/~ctappert/srd2011/p-stylometry.doc  · Web viewLexical features are word- or character-based statistical measures

Figure 3. Stylometry.Features File

This resulting is a features vector file. This Feature Data Format will allow the use of the same classifier from the keystroke authentication system to measure system accuracy.

The data takes the form of a text-readable file or corresponding spreadsheet. The form of the file is as follows with fields in a record comma delimited and items in a field slash delimited. The first record contains the name and description of the file, including the approximate date the samples were recorded. The second record contains the number of samples (pattern instances). The remaining number-of-pattern-instances records contain the following fields:

ID data (e.g. name/gender/date-of-birth or age) Person’s application-related information (e.g.,

handedness for mouse or keystroke biometric, nationality for speech accents, etc.)

Equipment-related information (e.g., mouse type, keyboard type, etc.)

Task performed (e.g., copy task, free-text email task, etc.)

number of feature attributes/measurements sequence of feature values/measurements,

normalized into the range 0-1

This resulting features vector file is then manually split into training and test files (4 for train and 4 to test for each student for our testing).

Figure 4. Sample Modified Test File

The files are then run through the classifier program (BAS) to identify the Stylometry patterns for each of the 15 subjects and thus authenticating them. The BAS system uses the k-Nearest Neighbour (KNN) classifier. Nearest neighbour rule is one of the simplest and most important methods in pattern recognition. KNN assigns different weights to the nearest neighbours according to the distance to the unclassified sample. Difference-weighted KNN weighs the nearest neighbours by using both the correlation of the differences between the unclassified sample and its nearest neighbours.

As part of the processing, the data is dichotomized into two classes. The BAS classifier using train on one set and test on another to get results.

Figure 5. BAS

After the dichotomy model is applied, the data is saved and is then ready to be processed using the BAS: Accuracy Calculator. The calculator uses the output file from the Biometric Authentication System and applies nearest neighbour calculations to determine the overall performance of the test, the false acceptance rate and the false rejection rate for each of the nearest neighbour calculations. The system asks for the number of N choices that is used to optimize the testing performance and the maximum nearest neighbour test.

5

Page 6: A Stylometry System for Authenticating Students Taking ...csis.pace.edu/~ctappert/srd2011/p-stylometry.doc  · Web viewLexical features are word- or character-based statistical measures

Figure 6. BAS Accuracy Calculator

2.3 Test Results Analysis

The resulting output is a Biometric Authentication System Results file. It displays the type of tests that were run (in this case Biometric). It displays the sample test size of intra-class and inter class, test size, FRR, FAR, performance, test subject samples averages and KNN (k-Nearest Neighbour).

In Biometrics, the FAR is the measure of the likelihood that the biometric security system will incorrectly accept an access attempt by an unauthorized user. This is considered the most serious of biometric security errors since it gives unauthorized users access to a system that they should not access. FAR is typically the ratio of the number of false acceptance divided by the number of identification attempts. FRR is the measure of the likelihood that the biometric system will incorrectly reject an access attempt by an authorized user. A false rejection rate is when a security system fails to verify an authorized user. This type of error does not indicate a flaw in the system. The FRR is typically stated as the ratio of the number of false rejections divided by the number of identification attempts.

The final results of the train and test data for Professor Stewart’s students was obtained. For Test 2 the train and test data was reversed to obtain performance.

Figure 7. System Results – Test 1

Figure 8. System Results – Test 2

In the first experiment, the FRR rates were within 90% to 100%. Again, this is the rate that students would be incorrectly rejected as an authorized user to the system. This is a high rate of rejections, which could have an impact to students accessing the system. The FAR rate was within 0% to 5.29%. These low rates are an indication that the system would not incorrectly accept access attempts by unauthorized users. The FAR rates low percentage is a much improved figure as the FAR rate is the most serious of security errors. Performance levels ranges from 89% to 94%.

For the second experiment, we reversed the train and test data. The FRR rates were close to the first experiment ranging from 88% to 100% and the FAR rate ranged from .45 to 10% (slightly higher than the first experiment. The performance levels ranged from 85% to 94%. Performance for both experiments was very close in range.

3. Stylometry Java Extractor Program -Additional Work Conducted

During our project time-frame, a Stylometry Extractor program supported by Java programming language was also additionally produced. Our initial project deliverable was to modify existing Keystroke Java code which could perform the Stylometry data extraction and calculations from a sample text file. After consulting with Java trained software engineers, a review of the current Keystroke code was undertaken. However, due to complexity of the Keystroke code and lack of appropriate programming comment notes, it was decided to develop new, standalone Stylometry data extraction code. The Stylometry System code created would parse a raw data

6

Page 7: A Stylometry System for Authenticating Students Taking ...csis.pace.edu/~ctappert/srd2011/p-stylometry.doc  · Web viewLexical features are word- or character-based statistical measures

standard character based text file, where the file features would be character-based, word-based, or syntactic-base. The Stylometry System code would then create a lookup directory and be able to pass data between development module folders and count character features and perform simple arithmetic calculations. The output will be in a format amenable to creating line, bar, or pie graphs and data sort analysis and data placed in a delimited/formatted text-readable file.

1st Record contains the name, description, and time the file was created

Stylometry / Stylometry feature / 2:30PM2nd Record contains the number of parse samples

/ 35 /3rd Record contains subject ID data /Joe Smith / M / age /4th Record contains Stylometry data /number of “numeric”, “alpha character” , “punctuation” etc. /5th Record contains calculations /number of “numeric’s divided by total number of words, number of alpha divided by total number of words , number of punctuation divided by total number of words, etc. /

Figure 9. Data Record Format

A small program prototype was constructed which checked the code on a file containing Standard English text data. Example: "The quick brown fox jumped over the lazy dog" with the output in either a HTML or a CSV file format. An abridged form of the program installation instructions are:

Step 1. Create Zip fileStep 2. Extract Zip file to C: / (create stylometry engine directory)Step 3. Specify the properties

Location of the raw data file Output directory Demographics file location

Step 4. Run batch file Step 5. Confirmation of batch file runStep 6. Output directory – review CSV output

Figure 10. Abridge Install Instructions

In completing the parsing and extraction program, a small feature extraction output was created. As a sample, a small text file was created.

Text = "A B C D E F ABDUL GREG SUNIL ?\':;!.\" 123   Watch the dog run."

Figure 11. Text File

The program contained 21 Stylometry type features for extraction. The output data was presented in two formats: feature counts and calculation. The feature count output, as seen in Figure 5, represents a simple numeric count of the Stylometry features, which were used as input to the File Data Management module of the program.

Word count 13Vowel count 11Alphanumeric count 37NUMERIC count 3ALPHABETIC count 34UPPERALPHABETIC count 21LOWERALPHABETIC count 13SPACE count 16ONELETTER count 6TWOLETTER count 0THREELETTER count 3FOURLETTER count 1FIVELETTER count 3CHARACTERS count 62 EIGHTPUNCTUATION count 9 PERIOD count 2 COMMA count 1 QUESTIONMARK count 1 EXCLAMATIONPOINT count 1 SEMICOLON count 1 COLON count 1

Figure 12. Features Count

As noted, this program can be utilized if needed to extract Stylometry features for simple text files and it can also be augmented to include additional feature files for any future Stylometry work.

4. Conclusions

As noted in the two experiments conducted, the rates for FRR were very high indicating that additional testing in conjunction with the Keystroke work needs to be conducted. Because of the lack of time, additional data from other users such as team members, friends, relatives etc. was not able to be accomplished. Thus, we were not able to obtain additional data to perform experiments.

Stylometry biometrics is a very useful tool that can be effectively used to accurately authentic student taking on-line tests. For this type of biometrics to become a standard tool for determining authorship, there needs to be a more focused review of appropriate features list to be used. In order to identify the author, an extraction of the most appropriate features to represent the style of the

7

Page 8: A Stylometry System for Authenticating Students Taking ...csis.pace.edu/~ctappert/srd2011/p-stylometry.doc  · Web viewLexical features are word- or character-based statistical measures

author is very important [6]. Some of the features are not appropriate features for student authentication and should not all be equally utilized. Students taking on-line exams may not necessarily use features such as certain type of punctuations or other synthetic based features. Also, the features extractor tool should be modified so features can be easily added or removed as required. Additional work is needed to implement this requirement.

4.1. Recommendations

Greater testing accuracy is needed, which can be accomplished by enlarging the sample size. Also, as per the Professor Stewart’s recommendation, additional work is needed on the demographic collection page of the KES system as it’s not beneficial when taking an exam.

Also, the new KES online test-taking system works well in capturing student data but the raw data generated needs to be manually adjusted to remove the keystroke data to be effectively used by the Stylometry extractor. Eventually, the system should be able to parse the new data format which includes both the keystroke and text data so no manual editing is required.

More work should be conducted in the field of Stylometry. As there is currently a Keystroke system already in place, greater effort should be made to ensure that Biometric Keystroke and Stylometry programs work cohesively and seamlessly together. The results of these initial tests do point to an eventual comprehensive system that can be effectively used by Pace University and other higher learning institutions.

5. References

[1] Ahmed Abassi, Hsinchun and Jay F. Nunamaker Jr, “Sylometric Identification in Electronic Marskets: Scalability and Robustness”, Journal of Management Information Systems/Summer 2008, Vol. 25, No.1 pp.49-78.

[2] Bailey, R.W. “Authorship Attribution in Forensic Setting. In D.E. Ager, F.E.Knowles, and J.Smith (eds), Advances in Computer-

Aided Literary and Linguistic Research”,AMLC,Birmingham,1979 , http://catalogue.nla.gov.au/Record/1366817

[3] C.E.Chaski “A daubert-inspired assessment of current techniques for language-based author identification. Technical Report 1098, ILE Technical Report, 1998”. http://portal.acm.org/results.cfm?query=&querydisp=&source_query=&start=501&srt=meta%5Fpublished%5Fdate%20dsc&short=0&source_disp=&since_month=&since_year=&before_month=&before_year=&coll=DL&dl=GUIDE&termshow=matchall&range_query=&CFID=112598117&CFTOKEN=44470178

[4] C.E.Chaski Who is at the keyboard. Author Attribution in Digital Evidence Investigations. International Journal of Digital Evidence,4(1),2005. http://www.utica.edu/academic/institutes/ecii/publications/articles/B49F9C4A-0362-765C-6A235CB8ABDFACFF.pdf

[5] M.Koppel and J.Schler. “Exploiting Stylistic Idiosyncrasies for Authorship Attribution. In Workshop on Computational Approaches to Style Analysis and Synthesis”, 2003. http://u.cs.biu.ac.il/~koppel/papers/ijcai-idiosyncrasy-final.pdf

[6] Daniel Pavelec, Edson Justino, and Louis S. Oliveria, “Author Identification using Stylometric Features”, Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial, AEPIA, 2007 pp. 59-65

[7] Rong Zheng, Jiexun Li, Hsinchun Chen and Zan Huang, “A Framework for Authorship Identification of Online Messages: Writing-Style Features and Classification Techniques”, Journal of the American Society for Information Science and Technology, February, 2006.

8