The complex networks approach for authorship attribution of books

9
Physica A 391 (2012) 2429–2437 Contents lists available at SciVerse ScienceDirect Physica A journal homepage: www.elsevier.com/locate/physa The complex networks approach for authorship attribution of books Ali Mehri , Amir H. Darooneh, Ashrafalsadat Shariati Department of Physics, Zanjan University, P.O.Box 45196-313, Zanjan, Iran article info Article history: Received 1 July 2011 Received in revised form 12 November 2011 Available online 11 December 2011 Keywords: Complex systems Computational linguistics Nonextensive statistical mechanics Authorship attribution abstract Authorship analysis by means of textual features is an important task in linguistic studies. We employ complex networks theory to tackle this disputed problem. In this work, we focus on some measurable quantities of word co-occurrence network of each book for authorship characterization. Based on the network features, attribution probability is defined for authorship identification. Furthermore, two scaling exponents, q-parameter and α-exponent, are combined to classify personal writing style with acceptable high resolution power. The q-parameter, generally known as the nonextensivity measure, is calculated for degree distribution and the α-exponent comes from a power law relationship between number of links and number of nodes in the co-occurrence network constructed for different books written by each author. The applicability of the presented method is evaluated in an experiment with thirty six books of five Persian litterateurs. Our results show high accuracy rate in authorship attribution. © 2011 Elsevier B.V. All rights reserved. 1. Introduction Recently the complex networks theory appears as a suitable framework for studying social and natural systems [1,2]. In the language of networks, the system’s entities are regarded as vertices of a graph and the links between vertices represent their interactions. It is not necessary to know the nature of interactions or microscopic details of the entities for construction of a graph corresponding to the system. Hence the complex networks theory facilitates the study of systems containing different types of entities with unknown or complicated interaction between them. The human language has the same state of complexity [3,4]. A great deal of human knowledge has been included in the written part of language, namely texts. There are more than 129,864,880 books in the world and several tens of billions documents exist on the Internet, hence it is impossible to find the desired information manually. Nowadays the complex networks theory provides suitable tools for automatic information retrieval from a particular text or repository of texts. Computing the textual network features or random walks on it helps us to retrieve much useful information about the text, like the word sense disambiguation, text summarization, keyword extraction, coreference resolution, question answering, document classification and assessing the quality of texts [5–9]. Some network metrics such as degree distribution, characteristic length and clustering coefficient, are strongly correlated with coherence, cohesion and adherence of texts which are scored by humans. The statistical mechanical approaches such as entropic methods are other candidates for information extraction from the texts [10,11]. The authorship analysis, as one of the important goals of text mining, has a long history back to medieval studies. Mosteller and Wallace’s well-known pioneering work on the disputed Federalist papers by means of the frequency of function words, paved the way for future investigations in this context [12]. Due to the increasing need for automatic authorship analysis, many researches have been made on this topic. The various aspects of authorship analysis, authorship identification, authorship characterization and similarity detection, play a crucial role in forensic analysis, plagiarism Corresponding author. Tel.: +98 2415152545. E-mail addresses: [email protected], [email protected] (A. Mehri). 0378-4371/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.physa.2011.12.011

Transcript of The complex networks approach for authorship attribution of books

Page 1: The complex networks approach for authorship attribution of books

Physica A 391 (2012) 2429–2437

Contents lists available at SciVerse ScienceDirect

Physica A

journal homepage: www.elsevier.com/locate/physa

The complex networks approach for authorship attribution of booksAli Mehri ∗, Amir H. Darooneh, Ashrafalsadat ShariatiDepartment of Physics, Zanjan University, P.O.Box 45196-313, Zanjan, Iran

a r t i c l e i n f o

Article history:Received 1 July 2011Received in revised form 12 November2011Available online 11 December 2011

Keywords:Complex systemsComputational linguisticsNonextensive statistical mechanicsAuthorship attribution

a b s t r a c t

Authorship analysis by means of textual features is an important task in linguistic studies.We employ complex networks theory to tackle this disputed problem. In this work, wefocus on some measurable quantities of word co-occurrence network of each book forauthorship characterization. Based on the network features, attribution probability isdefined for authorship identification. Furthermore, two scaling exponents, q-parameterand α-exponent, are combined to classify personal writing style with acceptable highresolution power. The q-parameter, generally known as the nonextensivity measure, iscalculated for degree distribution and theα-exponent comes fromapower law relationshipbetween number of links and number of nodes in the co-occurrence network constructedfor different books written by each author. The applicability of the presented method isevaluated in an experiment with thirty six books of five Persian litterateurs. Our resultsshow high accuracy rate in authorship attribution.

© 2011 Elsevier B.V. All rights reserved.

1. Introduction

Recently the complex networks theory appears as a suitable framework for studying social and natural systems [1,2]. Inthe language of networks, the system’s entities are regarded as vertices of a graph and the links between vertices representtheir interactions. It is not necessary to know the nature of interactions ormicroscopic details of the entities for constructionof a graph corresponding to the system. Hence the complex networks theory facilitates the study of systems containingdifferent types of entities with unknown or complicated interaction between them. The human language has the same stateof complexity [3,4].

A great deal of human knowledge has been included in the written part of language, namely texts. There are more than129,864,880 books in theworld and several tens of billions documents exist on the Internet, hence it is impossible to find thedesired information manually. Nowadays the complex networks theory provides suitable tools for automatic informationretrieval from a particular text or repository of texts. Computing the textual network features or random walks on it helpsus to retrieve much useful information about the text, like the word sense disambiguation, text summarization, keywordextraction, coreference resolution, question answering, document classification and assessing the quality of texts [5–9].Some network metrics such as degree distribution, characteristic length and clustering coefficient, are strongly correlatedwith coherence, cohesion and adherence of texts which are scored by humans. The statistical mechanical approaches suchas entropic methods are other candidates for information extraction from the texts [10,11].

The authorship analysis, as one of the important goals of text mining, has a long history back to medieval studies.Mosteller and Wallace’s well-known pioneering work on the disputed Federalist papers by means of the frequency offunction words, paved the way for future investigations in this context [12]. Due to the increasing need for automaticauthorship analysis, many researches have been made on this topic. The various aspects of authorship analysis, authorshipidentification, authorship characterization and similarity detection, play a crucial role in forensic analysis, plagiarism

∗ Corresponding author. Tel.: +98 2415152545.E-mail addresses: [email protected], [email protected] (A. Mehri).

0378-4371/$ – see front matter© 2011 Elsevier B.V. All rights reserved.doi:10.1016/j.physa.2011.12.011

Page 2: The complex networks approach for authorship attribution of books

2430 A. Mehri et al. / Physica A 391 (2012) 2429–2437

detection, spam filtering, information management and many other linguistic applications [13–16]. So far, many textualfeatures and various classification techniques are used in this framework. Zipf analysis, entropic metrics and many otherfeatures have been introduced over the years [17–21]. People have classified the textual features into several categories.In general, these features can be divided into lexical, syntactic, semantic and application-specific categories. The statisticalanalysis such as Naïve Bayes, and machine learning methods such as decision trees, neural networks and support vectormachines are the most promising classification techniques, frequently applied in the literature [22–24].

More recently, some network features like the clustering coefficient, degree correlation and average of the out-degreeare applied for authorship characterization [25]. Here, we intend to add some other features to this network feature setand assign a vector to each author to distinguish him/her from others with a probabilistic method; performing a minimumpreprocessing on the texts. Each author’s vector includes some average value for corresponding network features of his/herworks such asmean degree, characteristic length and q-parameter of the degree distribution. A new scaling behavior will beintroduced for the relation between the number of links and the number of nodes in the co-occurrence network extractedfor every author’s works.

In the next section, we introduce a probabilistic method for authorship attribution based on complex networks theory.The third section, contains technical details of an experiment with several literary works of five Persian litterateurs forauthorship attribution. We report results of the experiment with some explanations in Section 4. The final section isdevoted to summarization of our work. We also briefly review some structural features of networks and Tsallis statisticsin Appendices A and B, respectively.

2. Authorship attribution by complex networks theory

Natural languages, as complex systems, are neither ordered and predictable—like the configuration of atoms in a crystallattice—nor random and chaotic—like the motion of gas molecules; but there is a mixture of these two features in theirbehavior. Network (graph) theory provides a powerful tool to study such complex systems. A network includes a collectionofNn vertices or nodes and a collection ofNl edges or links, used tomodel pairwise relations between the objects of a system.In the physical context, the number of vertices, Nn, identifies the number of system’s constituents and defines the physicalsize of the network. A popular structure for network representation is called the adjacency matrix, A = [aij]Nn×Nn ,

aij =

ωij if i and j are adjacent0 otherwise

where ωij shows the weight of every link, associated with the strength of interaction between vertices i and j. In this study,we will construct undirected (ωij = ωji), unweighted (ωij = 1) networks for texts.

2.1. Network construction for text

Various criteria for the relation between terms of a list can be used to create its network. Co-occurrence, syntactic andsemantic dependencies are applied in network construction for a text. Word types are represented as vertices in the co-occurrence network of a text and therewill be a link between the nearest neighbors. In the extracted co-occurrence network,the number of nodes, Nn, is equal to vocabulary size of the text, Nv (Nn = Nv); whereas the number of links, Nl, is less thanor equal to text size, Nt (Nl ≤ Nt). If the beginning of the text is connected to its end, the total strength of the correspondingweighted co-occurrence network will be identical to text size; that is,

i,j aij = Nt in the directed-weighted network and

i<j aij = Nt in the undirected-weighted one. The number of links, in the unweighted co-occurrence network of a text,shows number of its distinct word bigrams. In Fig. 1 we represent an undirected-unweighted co-occurrence network and anadjacency matrix for an example short text. The network structure of each book and then its writing style can be uncoveredby means of different measurable quantities. Up to now, several measurable quantities have been defined to investigatedifferent features of the networks.We apply the characteristic length, the average of degree, degree distribution, the averageof nearest neighbors degree and the average of clustering coefficient for authorship attribution, which are briefly describedin Appendix A.

2.2. Another scaling law for author’s oeuvres

Heaps’ law is an important empirical law in linguistics, which states that the vocabulary will keep growing with corpussize [4]. In other words, this law determines how the size of the inverted index will scale with the size of the corpus. Heaps’law has a cumulative nature and indicates a power law relation between length of various fractions of a corpus and theirvocabulary size. It is clear that the larger fractions will certainly contain all word tokens of smaller fractions. Hence, thevocabulary set of short fractions of a corpus is a subset of the vocabulary set of its long fractions (Fig. 2-(a)). In the generalcase, Heaps’ law implies a power law relation between the number of distinct types and the collection size. In the context ofnetwork theory, the inverse Heaps’ law explains the relationship between number of links and number of nodes: Nl ≈ Nn

β .Meanwhile, during this work we observe another interesting regularity in the co-occurrence network of every author’s

works. We find a power law relationship, with a certain scaling exponent (α-exponent) for each writer, between number of

Page 3: The complex networks approach for authorship attribution of books

A. Mehri et al. / Physica A 391 (2012) 2429–2437 2431

Fig. 1. (Color online) Undirected-unweighted co-occurrence network and adjacency matrix for a short quote by Mahatma Gandhi, ‘‘where there is love,there is life’’, with Nn = Nv = 5,Nl = 6 and Nt = 7. The punctuation marks and numbers should be excluded before the network construction process.The head and tail of the text are also connected.

a b

Fig. 2. (Color online) Schematic illustration of relation between vocabulary sets used in (a) Heaps’ law and (b) new scaling law. In Heaps’ law thevocabularies of smaller corpus fractions are subsets of larger ones. On the other hand, in the new scaling law a small vocabulary is not necessarily asubset of a large vocabulary.

links (Nl) and number of nodes (Nn), in the co-occurrence network extracted for his/her every book: Nl ≈ Nnα . This scaling

exponent can be used as a new lexical feature in authorship identification. It is worth noting that this law differs fromHeaps’law, because the new scaling law does not have a cumulative nature and it talks about power law dependency between thenumber ofword bigrams in various productions of a single author as a function of their vocabulary size. In this case, a specificbook does not necessarily contain all word tokens of another one. This means that the vocabulary set of each book will notnecessarily be a subset of another book (Fig. 2-(b)). In practice, the calculating α-exponent requires lower computationaleffort, being its principal merits.

2.3. Authorship attribution probability

It is convenient to assign a vector to each author containing average properties of his/her accessible oeuvres network.Now, suppose there is a list of authors containing properties of their handwritings’ networks. On the other hand, we have abookwith an unknown author and ourmain goal is its author identification. At first, we should calculate network propertiesof the unknown book. In the rest of the section, we will introduce a probabilistic approach for authorship attribution. Themeasurement attribution probability for every measurement, pa(m), can be defined as

pa(m) = 1 −|Mx(m) − Ma(m)|

Naa′=1

|Mx(m) − Ma′(m)|

(1)

where the index a refers to an author andm denotes a certainmeasurement. |Mx(m)−Ma(m)| is distance between the valueof considered measurement for unknown book, Mx(m), and its value for a specific author Ma(m). This distance is dividedby the sum of all such distances for the authors of our list. Na stands for the number of authors which exist in the list.pa(m) will be the authorship attribution probability provided by a single measurement of the book’s network. For the sakeof higher accuracy, we combine attribution probability of several measurements. Since the network features discussed herehave distinct natures and refer to different concepts, it is acceptable to assume they are statistically independent of eachother. Therefore, the total authorship attribution probability, pa, is defined as themultiplication of these singlemeasurementattribution probabilities,

pa = NNmm=1

pa(m) (2)

where Nm denotes the number of applied measurements in the authorship attribution process and N is a normalizationfactor defined by

Naa′=1 pa′ = 1. This hybrid method combines different features of the text for author characterization. The

Page 4: The complex networks approach for authorship attribution of books

2432 A. Mehri et al. / Physica A 391 (2012) 2429–2437

Table 1Thirty six adopted books of five Persian litterateurs and their attribution probability for each author, obtained by our probabilistic method. pB, pI , pM , pNand pS stand respectively for attribution probability for Bahayi, Iqbal, Molavi, Nezami and Salman. The maximum attribution probability of each book ishighlighted with bold numbers in each row. Qazal, Qateh, Masnavi, Robaayi and Qasideh are different styles of Persian poetry and each writer may havesome books with these names.

Author Book pB pI pM pN pS

Bahayi 1547–1621 Persian scholar

Qazaliaat 0.47 0.21 0.06 0.07 0.19Qataat 0.45 0.18 0.08 0.09 0.20Masnavi 0.50 0.18 0.07 0.07 0.18Mokhammasaat 0.48 0.19 0.07 0.07 0.19Naan-o-Halvaa 0.50 0.05 0.08 0.09 0.28Naan-o-Paneer 0.33 0.27 0.07 0.08 0.25Robayiaat 0.43 0.21 0.07 0.08 0.21Sheer-o-Shekkar 0.53 0.18 0.06 0.06 0.17

Iqbal 1877–1938 Pakistanian poet

Armaqan-e-Hijaz 0.14 0.39 0.05 0.10 0.32Asrar-e-Khodi 0.12 0.40 0.05 0.12 0.31Javeed-Naameh 0.03 0.28 0.13 0.19 0.37Pas Che Bayad Kard 0.05 0.46 0.04 0.13 0.32Payam-e-Mashreq 0.03 0.46 0.04 0.13 0.34Romouz-e-Bikhodi 0.06 0.45 0.04 0.12 0.33Zabour-e-Ajam 0.04 0.50 0.04 0.11 0.31

Molavi 1207–1273 Persian poet

Fihe-ma-Fih 0.03 0.14 0.40 0.24 0.19Masnavi-e-Manavi (vol. 1) 0.01 0.11 0.45 0.27 0.16Masnavi-e-Manavi (vol. 2) 0.01 0.11 0.43 0.28 0.17Masnavi-e-Manavi (vol. 3) 0.01 0.10 0.49 0.25 0.15Masnavi-e-Manavi (vol. 4) 0.01 0.12 0.52 0.29 0.06Masnavi-e-Manavi (vol. 5) 0.01 0.10 0.48 0.26 0.15Masnavi-e-Manavi (vol. 6) 0.02 0.10 0.47 0.26 0.15

Nezami 1141–1209 Persian poet

Haft Peikar 0.02 0.15 0.23 0.43 0.17Kherad-Naameh 0.01 0.17 0.17 0.45 0.20Khosrow-o-Shirin 0.02 0.15 0.24 0.41 0.18Leyli-o-Majnoun 0.01 0.25 0.13 0.31 0.30Makhzan-al-Asraar 0.03 0.32 0.09 0.20 0.36Sharaf-Naameh 0.03 0.16 0.26 0.37 0.18

Salman 1309–1376 Persian poet

Qataat 0.05 0.36 0.05 0.10 0.44Feraq-Naameh 0.03 0.38 0.06 0.11 0.42Qasaayed 0.01 0.18 0.19 0.37 0.25Robayiaat 0.25 0.33 0.05 0.09 0.28Qazaliaat 0.03 0.16 0.25 0.37 0.19Tarjiaat 0.31 0.27 0.05 0.09 0.28Tarkibaat 0.19 0.35 0.05 0.09 0.32Jamshid-o-Khorshid 0.02 0.20 0.15 0.34 0.29

combination of more textual features can lead to a better result. In the next section, we will perform an experiment to testour authorship identification method.

3. Experiment

In a procedure for the evaluation of the method, we adopt thirty six books of five famous Persian writers, Bahayi, Iqbal,Molavi, Nezami and Salman. These litterateurs are selected from different periods to study the time evolution of the Persianlanguage network structure. Table 1 contains life period, activity field and adopted books for each author. Most of thesebooks are freely available on Ganjoor’s website [26].

In the first stage, we mapped all words to lowercase and exclude all punctuation marks and numbers from all texts.A simple whitespace-based tokenization applied for delimitation word tokens [27]. Then, we construct an undirected-unweighted co-occurrence network for each book without any lemmatization. Now it is possible to extract some structuralfeatures for each book. To classify the degree distribution of extracted networks, we employ q-exponential distribution asa generalization of ordinary power law distribution. We depict the cumulative degree distribution data of the book fihe-ma-fih written by Molavi as a typical representative text, and fit results with the q-exponential function and power lawfunction in Fig. 3. As seen in the figure, the behavior of degree distribution is better described by the q-exponential function.The q-parameter is obtained by fitting the cumulative degree distribution with the cumulative q-exponential function. Theq-exponential distribution and its application in the analysis of degree distribution is briefly illustrated in the Appendix B.

As we explained in the previous section, there is a power law dependency between the number of links and thenumber of nodes in the co-occurrence network of each book: Nl ≈ Nn

α . We apply the following numerical method toestimate power law exponent, α, for each author. Let us assume Nb books of a particular author are accessible. We firstextract the number of links and number of nodes for each book, and then we calculate Nb α-exponents by use of Nb − 1

Page 5: The complex networks approach for authorship attribution of books

A. Mehri et al. / Physica A 391 (2012) 2429–2437 2433

Fig. 3. (Color online) Cumulative degree distribution (blue diamonds) of the book fihe-ma-fihwritten by Molavi. The red dashed curve represents the bestfitting result applying the q-exponential function with q = 1.46, r2 = 0.999 and std.err. = 0.002. The green dotted line is the power law fitting withr2 = 0.988 and std.err. = 0.086. The empirical data are better fitted with the q-exponential function, especially for small degrees.

Fig. 4. (Color online) Two scaling laws for Bahayi’s books. The vertical axis is number of links and the horizontal axis is number of vertices. The bluediamonds in the left graph and the red squares in the right graph are empirical data for the new presented law and the Heaps’ law, respectively. The bluedashed line and the red dotted line are power law model functions with scaling exponents α = 1.18 and β = 1.33.

Table 2Average network properties and their standard errors for five Persian litterateurs, Bahayi, Iqbal, Molavi, Nezami and Salman. The network quantities arethe average of nearest neighbors degree, degree, clustering, characteristic length, q-parameter for degree distribution and α-exponent respectively.

Bahayi Iqbal Molavi Nezami Salman

knn 24.61 ± 13.80 129.40 ± 39.27 422.13 ± 105.02 299.77 ± 101.32 151.64 ± 113.45k 3.83 ± 0.66 7.05 ± 0.96 9.08 ± 0.38 10.19 ± 1.59 7.62 ± 2.52c 0.09 ± 0.03 0.19 ± 0.04 0.33 ± 0.02 0.27 ± 0.04 0.21 ± 0.08l 3.85 ± 0.30 3.10 ± 0.09 2.98 ± 0.01 2.99 ± 0.09 3.17 ± 0.22q 1.35 ± 0.05 1.39 ± 0.02 1.45 ± 0.01 1.45 ± 0.01 1.42 ± 0.02α 1.18 ± 0.01 1.59 ± 0.08 0.84 ± 0.13 1.70 ± 0.07 1.39 ± 0.03

data points in each fitting process. Hence, we should eliminate one of the Nb books and find the α-exponent for theremaining Nb − 1 books by fitting with a power law model function. Then we calculate the average α-exponent and itsstandard deviation for the selected author. The value of α-exponent differs from author to author because of their differentwriting styles. Therefore, we will use this exponent for authorship characterization in the next section. Note that again, theα-exponent fundamentally differs fromHeaps’β-exponent since they arise fromdifferent statistical concepts. Fig. 4 displaysthe presented scaling law and Heaps’ law for Bahayi’s books as a representative author. In the curve of the new scaling lawevery data point (blue diamond) belongs to a single book. For performing theHeaps’ law,we prepare a large corpus includingall of Bahayi’s books and then we extract a number of nodes in the co-occurrence network constructed for various fractionsof this corpus. The empirical data for the new scaling law and the Heaps’ law are well fitted with power law functions,respectively with exponents α = 1.18 (blue dashed line in the left graph) and β = 1.33 (red dotted line in the right graph).

We calculate the average of network quantities and their standard deviation for every author’s literary works. Averageof nearest neighbors, degree, clustering coefficient, characteristic length, q-parameter, α-exponent and their standarddeviation are reported in the Table 2. We use these features for authorship identification with the presented probabilisticmethod. In the attribution probability calculation for each book, its features have been excluded when obtaining averagefeatures of its author.

Page 6: The complex networks approach for authorship attribution of books

2434 A. Mehri et al. / Physica A 391 (2012) 2429–2437

4. Results and discussion

We report authorship attribution probability of all thirty six adopted books written by five mentioned Persian authorsin the Table 1. The maximum attribution probability of each book is highlighted with bold numbers. The attributionprobability for Bahayi, Iqbal, Molavi, Nezami and Salman are denoted by pB, pI , pM , pN and pS respectively. All of Bahayiand Molavi’s books are successfully attributed to their right author. Apart from an exception, all of Iqbal and Nezami’sbooks are also truly attributed. In the case of Salman’s books, the attribution process successfully attributes just two ofhis books. In all eight unsuccessful cases, the attribution probability of the correct author does not have the maximumvalue but it has one of the highest values. Furthermore, if the authors have similar writing styles, then the accuracyof authorship attribution will decrease. The standard error has a crucial effect on the ability of a particular feature inauthorship identification. A quantity with a smaller error will have a better result in this procedure. On the other hand,applying more books to extract network features for each author provides more accurate attribution. The identificationaccuracy rate of authorship attribution methods shows their prediction power. The common form of the accuracy can bewritten as

Accuracy =TP + TN

TP + TN + FP + FN(3)

where TP, TN, FP and FN are the number of true positives, true negatives, false positives and false negatives in the attributionprocedure [28].

• True Positive: Number of books truly attributed to the right author.• True Negative: Number of books truly not attributed to the wrong author.• False Positive: Number of books falsely attributed to the wrong author.• False Negative: Number of books falsely not attributed to the right author.

From Table 1, we immediately see that TP = 28, TN = 136, FP = 8, and FN = 8. Therefore, the accuracy level reached withnetwork features is 91%. This fact confirms the ability of network features in authorship analysis. It is worth remarking thatsome factors including the number of candidate authors, the number and the size of available known texts have a crucialeffect on the accuracy rate of attribution methods [22].

Resolution power of textual features in authorship characterization determines their ability in this way. The main ideais the distinction between different authors by means of relations between two or more features of the text. The resolutionpower of themutual relation between all mentioned features for five selected authors is displayed in Fig. 5. The combinationof α-exponent and q-parameter has acceptable resolution power in authorship characterization. One notes that, in othercharacterization problems the combination of some other features may be required to achieve high resolution power andaccurate results.

5. Conclusion

We have studied human written texts by means of complex networks theory to present new features for authorshipanalysis. Networks produced for each author’s books have specific features related to his/her writing style, which canbe used to reveal text authorship. As a new feature, we addressed a power law relation between the number of links(distinct word bigrams) and the number of nodes (distinct words), in the undirected-unweighted co-occurrence networkof text. In an experiment with thirty six books of five Persian litterateurs, the presented structural features have beenused in authorship attribution process without any lemmatization in the texts. Reliable accuracy of our results showsthat the presented network features are applicable in authorship analysis. The combination of some network propertieswith high resolution power provides an appropriate method for authorship characterization. Here the combination ofα-exponent and q-parameter (nonextensivity measure) has an acceptable resolution for authorship characterization. Inorder to improve the analysis accuracy, these network features can be combined with other textual features in languageanalysis.

It is worth pointing out that we can use the same strategy for investigation on the nucleotide sequences in order tocharacterize their role or functionality. The same method may also be applied in analyzing time series. The time seriescontain important information about complex systems. The translation of a time series to an alphabetical sequence [29]allows us to construct their corresponding networks and describe the peculiarities of such systems in terms of networkmetrics.

Acknowledgments

We would like to thank M. Moradi, Y. Azizi and M. Nikbakht for useful discussions.

Page 7: The complex networks approach for authorship attribution of books

A. Mehri et al. / Physica A 391 (2012) 2429–2437 2435

Fig. 5. (Color online) Resolution power of network features (Table 2) in authorship characterization for five selected Persian authors, Bahayi (violetdiamond), Iqbal (pink square), Molavi (blue triangle), Nezami (green mutual triangles) and Salman (orange circle). The α-exponent combination withother features, especially with q-parameter, has applicable resolution in this characterization problem.

Appendix A. Some structural features of networks

In a connected network every vertex is reachable from any other vertex by following the links in the network. The pathlength between two connected vertices i and j is the number of links from vertex i to reach vertex j. The shortest path lengthbetween vertices i and j is denoted as lij and its average over all network vertices will be the characteristic length of thenetwork

l =1

Nn(Nn − 1)

Nni,j=1

lij. (A.1)

The degree of a vertex i is the number of links incident on it: ki =Nn

j=1 aij. We are interested in asymptotic regularitiesin the network as a large statistical system. The degree distribution p(k) of a network seems to be a convenient function todisplay the global behaviors. p(k) is defined as the probability that a randomly chosen vertex has degree k. It is easy to see

Page 8: The complex networks approach for authorship attribution of books

2436 A. Mehri et al. / Physica A 391 (2012) 2429–2437

that the average of degree for a network kwill be

k =1Nn

Nni=1

ki =

kmaxkmin

kp(k) (A.2)

where kmin and kmax are minimum and maximum values of degree in the network. It is shown that in many naturalsystems degree distribution represents power law behavior. We use q-exponential distribution, as a generalized powerlaw distribution, to analyze degree distribution. A brief description about the derivation of q-exponential distribution isprovided in the Appendix B.

The degree correlation between adjacent vertices provides an excellent evidence of structural ordering in the network.Inspired by Markovian approximation, the simplest degree correlation can be defined by two-point conditional probabilityp(k′

|k), as a probability of the existence of a link between two vertices with degrees k′ and k. In the practical cases at first,people compute the average of nearest neighbors degree for every node in the network: knn,i =

j∈V (i) kj/ki, where the set

V (i) contains its nearest neighbors. The average of knn,i over the network vertices will be

knn =1Nn

Nni=1

knn,i. (A.3)

The tendency to form cliques in the neighborhood of the network vertices is referred to as its clustering. The clusteringcoefficient ci of a given vertex i is defined as the ratio of the number of links between the neighbors of i and the maximumnumber of such links: ci = [

j,l aijajlali]/[ki(ki − 1)]. According to this definition, ci lies in the range 0 6 ci 6 1. It will be

zero for ki 6 1. Now, the average clustering coefficient of a network can be written as

c =1Nn

Nni=1

ci. (A.4)

Appendix B. Tsallis statistics for text mining

From the physical viewpoint, a text can be considered as a one-dimensional complexmany-body system partitioned to afinite number of sections and each section contains a single element. The elements of the text (Nt words) are distributed in aspecial routine to produce an especial concept. The traditional statistical mechanics is unable to provide a useful descriptionof this type of systems with finite size and long range correlation, due to their nonextensivity [30]. Tsallis proposed a novelformalism for nonextensive statistics, as a generalization of Boltzmann–Gibbs–Shannon (BGS) standard statistics, to explainthe anomalous behaviors of such nonextensive systems [31,32]. In this context, for discrete degree distribution of verticesin the co-occurrence network of a text, Tsallis nonextensive entropy, as the main core of his formalism, can be written inthe following way:

Sq ≈

1 −

kmaxk=kmin

pqk

q − 1(B.1)

where kmin, kmax and pk stand for minimum degree, maximum degree and the escort probability of finding a vertex withk links. Parameter q is determined by the microscopic dynamics of the system of interest and characterizes its degree ofnonextensivity. In the limit as q → 1, the extensive BGS entropy is recovered. The stationary probability distribution, pk, isobtained by the maximization of Tsallis entropy under appropriate constraints. In this case, the probability normalizationand the expectation value of degree, in the undirected-unweighted co-occurrence network constructed for text, are imposedon the system.

kmaxk=kmin

pk = 1 (B.2)

kmaxk=kmin

kpk =2Nl

Nn. (B.3)

Under these assumptions, we obtain a well knownmodification of Zipf–Mandelbrot power law distribution, which is calledq-exponential distribution in the context of nonextensive statistical mechanics.

pk ≈ [1 − (1 − q)β ′k]1

1−q (B.4)where β ′ is the Lagrange parameter associated with the expectation value constraint. The q-exponential function tends tothe ordinary exponential function in the q → 1 limit. In real cases, step-like noises are often seen in the tail of the probabilitydistribution as a result of the same value for probability of the rare events. In fitting empirical data with a model function,the noisy tail causes inconvenient results. To avoid the irregularities, instead of working with escort probability of a certain

Page 9: The complex networks approach for authorship attribution of books

A. Mehri et al. / Physica A 391 (2012) 2429–2437 2437

state of the system under consideration, we focus on the probability that an outcome is larger than a certain state, Pi, whichis called cumulative distribution.

Pk =

kmaxk′=k

pk′ = [1 − (1 − q)β ′k]1

1−q +1. (B.5)

The exact result of summation in Eq. (B.5) is not a differentiable function. Therefore, we use its continuum limitapproximation. The cumulative distribution of the escort probability distribution has still the q-exponential form. Degreedistribution data fit with the q-exponential function better than the ordinary power law function. Accordingly, we have usedthe nonextensivity measure in the authorship attribution process.

References

[1] S. Bornholdt, H.G. Schuster (Eds.), Handbook of Graphs and Networks, Wiley VCH, Weinheim, 2003.[2] N. Boccara, Modeling Complex Systems, Springer, New York, 2004.[3] G.K. Zipf, Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology, Addison-Wesley, Cambridge, 1949.[4] H.S. Heaps, Information Retrieval: Computational and Theoretical Aspects, Academic Press, New York, 1978.[5] D.R. Radev, R. Mihalcea, AI Mag. (2008) 16. FALL.[6] L. Antiqueira, O.N. Oliveira Jr., L. da F. Costa, M. das G.V. Nunes, Inform. Sci. 179 (2009) 584.[7] J. Otterbacher, G. Erkan, D.R. Radev, Inf. Process. Manage. 45 (2009) 42.[8] J.T. Stevanak, D.M. Larue, L.D. Carr, 2010. arXive:1007.3254v2.[9] L. Antiqueira, M. das G.V. Nunes, O.N. Oliveira Jr., L. da F. Costa, Physica A 373 (2007) 811.

[10] A. Mehri, A.H. Darooneh, Phys. Rev. E 83 (2011) 056106.[11] A. Mehri, A.H. Darooneh, Physica A 390 (2011) 3157.[12] F. Mosteller, D.L. Wallace, J. Amer. Statist. Assoc. 58 (1963) 275.[13] P. Juola, Foundations and trends in Information Retrieval 1 (2006) 233.[14] H. Love, Attributing Authorship: An Introduction, Cambridge University Press, Cambridge, 2002.[15] G.R. McMenamin, Forensic Linguistics: Advances in Forensic Stylistic, CRC Press, Boca Raton, 2002.[16] P. Craiger, S. Shenoi (Eds.), Advances in Digital Forensics III, Springer, New York, 2007.[17] S. Havlin, Physica A 216 (1995) 148.[18] D. Benedetto, E. Caglioti, V. Loreto, Phys. Rev. Lett. 88 (2002) 048702.[19] A.C.-C. Yang, C.-K. Peng, H.-W. Yien, A.L. Goldberger, Physica A 329 (2003) 473.[20] T.J. Putninš, D.J. Signoriello, S. Jain, M.J. Berryman, D. Abbott, Proc. SPIE 6039 (2005) 163.[21] O.A. Rosso, H. Craig, P. Moscato, Physica A 388 (2009) 916.[22] R. Zheng, J. Li, H. Chen, Z. Huang, JASIST 57 (2006) 378.[23] M. Koppel, J. Schler, S. Argamon, JASIST 60 (2009) 9.[24] E. Stamatatos, JASIST 60 (2009) 538.[25] L. Antiqueira, T.A.S. Pardo, M. das G.V. Nunes, O.N. Oliveira Jr., Inteligencia Artificial 11 (2007) 51.[26] http://ganjoor.net/.[27] C.D. Manning, H. Schütze, Foundations of Statistical Natural Language Processing, The MIT Press, Cambridge, 1999.[28] D.L. Olson, D. Delen, Advanced Data Mining Techniques, Springer-Verlag, Berlin, 2008.[29] A.H. Darooneh, B. Rahmani, Eur. Phys. J. B 70 (2009) 287.[30] M. Gell-Mann, C. Tsallis (Eds.), Nonextensive Entropy-Interdisciplinary Applications, Oxford University Press, New York, 2004.[31] http://tsallis.cat.cbpf.br/biblio.htm/.[32] C. Tsallis, Introduction to Nonextensive Statistical Mechanics, Springer, Berlin, 2009.