Automatic clustering of construction project documents based on textual similarity

Automation in Construction 42 (2014) 36–49

Contents lists available at ScienceDirect

Automation in Construction

j ourna l homepage: www.e lsev ie r .com/ locate /autcon

Automatic clustering of construction project documents based ontextual similarity

Mohammed Al Qady ⁎, Amr Kandil 1

School of Civil Engineering, Purdue University, West Lafayette, IN 47907-2051, United States

⁎ Corresponding author at: 2775 Windwood Dr. #178217 4196419.

E-mail addresses: [email protected] (M. Al Qady)(A. Kandil).

1 Tel.: +1 765 494 2246.

http://dx.doi.org/10.1016/j.autcon.2014.02.0060926-5805/© 2014 Elsevier B.V. All rights reserved.

a b s t r a c t
a r t i c l e i n f o
Article history:Accepted 8 February 2014Available online xxxx

Keywords:Document managementSingle pass clusteringSupervised/unsupervised learning methods

Text classifiers, as supervised learning methods, require a comprehensive training set that covers all clas-ses in order to classify new instances. This limits the use of text classifiers for organizing construction pro-ject documents since it is not guaranteed that sufficient samples are available for all possible documentcategories. To overcome the restriction imposed by the all-inclusive requirement, an unsupervised learn-ing method was used to automatically cluster documents together based on textual similarities. Repeatedevaluations using different randomizations of the dataset revealed a region of threshold/dimensionalityvalues of consistently high precision values and average recall values. Accordingly, a hybrid approachwas proposed which initially uses an unsupervised method to develop core clusters and then trains atext classifier on the core clusters to classify outlier documents in a consequent refinement step. Evalua-tion of the hybrid approach demonstrated a significant improvement in recall values, resulting in an over-all increase in F-measure scores.

© 2014 Elsevier B.V. All rights reserved.

1. Introduction

Automatic classification of documents as a supervised learningmethod requires a set of class labels and samples of each class in orderto conduct the learning process before being able to performpredictionsfor new document instances. Usually, the classification procedure as-sumes that the classes are all inclusive (that they make a complete setof all the possible outcomes for any new instance) and that they aremu-tually exclusive (any new instance can belong to one and only oneclass).Where classes are static and predefined, the use of text classifiersfor automatically organizing documents is appropriate. Documents aretraditionally organized in construction projects according to fixed, ab-stract categories based on document metadata [1]. Examples of studiesinvestigating the use of automatic text classification of constructiondocuments include identifying the corresponding project divisionfor minutes of meeting items [2] and classifying product documents totheir relevant division in a construction information classificationsystem [3].

Ann Arbor, MI 48105. Tel.: +1

, [email protected]

While traditional methods of organizing construction projectdocuments are simple and easy to use, they are not very useful for infor-mation retrieval unless the information seeker has thorough knowledgeof the document body [1]. Information regarding a researched knowl-edge topic is almost always distributed overmultiple categories thus re-quiring understanding of document content, not just metadata, todetermine relevancy of a document to the researched topic; a time-consuming task that entails the application of human semantic capabil-ities. Also, the above-mentioned restrictions that constrain the use ofclassifiers do not apply with unsupervised methods: unsupervisedmethods do not require previous identification of all possible classesnor are they trained from sample data. The objective of this study is toevaluate the performance of an unsupervised learning text analysistechnique in organizing project documents into groups of semanticallysimilar documents; each group defined by its relation to a specificsearchable knowledge topic. It is hypothesized that textual similaritybetween project documents accurately reflects semantic relationshipsbetween the documents and, when applied in document managementand information retrieval tasks, can achieve results comparable towhat humans recognize using their semantic capabilities. In the nextsection, the text analysis technique used in the study is presentedalongwith several of its applications in previousworks. Then the meth-odology implemented for the evaluation is presented, followed by a de-tailed analysis of the results. The study is concluded with a summary ofthe main results and a discussion on practical uses and limitations ofimplementing the proposed technique.

http://crossmark.crossref.org/dialog/?doi=10.1016/j.autcon.2014.02.006&domain=pdf

http://dx.doi.org/10.1016/j.autcon.2014.02.006

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.autcon.2014.02.006

http://www.sciencedirect.com/science/journal/09265805

37M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49

2. Clustering

Research on clustering methods for information retrieval datesback to the second half of the twentieth century. The main objectiveof clustering is to provide structure to a large dataset by organizingsimilar data together thus facilitating search and retrieval tasks.Clustering methods can be categorized according to the structurethey generate into flat clustering and hierarchical clustering [4].With flat- or non-hierarchical-clustering, the dataset is divided intoa number of subsets of highly similar elements, dissimilar from ele-ments in other clusters, with no relationship between the differentclusters. The main advantage of this simple structure is low compu-tational complexity in comparison with the more sophisticated hier-archical clustering methods. With hierarchical clustering, a complexstructure of nested clusters is produced of the dataset. This is eitherdone using a bottom–up approach, in which clusters start as individ-ual items and pairs of similar items are joined together to form clus-ters, which are then joined together in successive steps until a singlehierarchy is formed of the complete dataset. This approach is calledagglomerative hierarchical clustering and is more popular than thetop–down approach where the whole dataset is considered one clus-ter and is successively broken down into pairwise clusters until thelevel of the individual items is reached (also referred to as divisivehierarchical clustering). Flat clustering techniques include K-meansand single pass clustering, while agglomerative hierarchical cluster-ing techniques include single-link, complete-link and group-average. In terms of the exclusivity of cluster membership, clusteringalgorithms can be divided into hard clustering and soft clusteringalgorithms. In the former, membership of the items is limited toonly one cluster. In the latter, the degree of association of eachitem to each cluster formed is determined [5].

Clustering was used in applications in many fields including or-ganizing patient data in the medical field, classification of speciesin biological taxonomies and studying census and survey responses[4]. Clustering has a wide range of applications for data manage-ment in civil engineering. In the field of structural system identifica-tion, Saitta et al. [6] used K-means clustering to narrow down thenumber of candidate structural models in order to identify thebest model that reflects actual sensor measurements of a structure.Principal component analysis was used to enable visualization ofthe various possible model clusters based on the most relevantmodel parameters. Cheng and Teizer [7] implemented clusteringto identify objects from point cloud data of a laser scanner inorder to enhance visibility of tower crane operators for saferhoisting operations. The DBSCAN algorithm was used for clustering.Similar to single pass clustering, DBSCAN starts with a randomly se-lected data point and successively forms clusters based on two userdefined parameters: maximum allowable distance from the chosenpoint and minimum cluster size.

In data-mining of databases, Ng et al. [8] used K-means clusteringto automatically group similar facility condition assessment reportsof university facilities to investigate the relationship betweenreported deficiencies and facility types. A qualitative evaluationwas used to verify the results of the investigation. Raz et al. [9] inves-tigated the use of multiple techniques, including clustering, for de-veloping models of good quality truck weigh-in-motion traffic datain order to facilitate identification of data anomalies. Two clusteringtechniques were investigated, K-means and Rectmix—a soft cluster-ing algorithm. Implementation of the proposed mechanism by adomain expert was used to evaluate the accuracy and usefulness ofthe mechanism.

Clustering techniques were applied for defects' detection fromimages in several studies including: detection of potential defectiveregions in wastewater pipelines [10] and detection of rust in steelbridges to support decisions regarding bridge painting activities[11]. In the former study, region-growing segmentation – an

application of single pass clustering to image data –was implement-ed for detecting defects in pipes using image analysis. Evaluationwasperformed based on the comparison of the results of the proposedtechnique with the inspection reports of a certified inspector usingthe followingmetrics: accuracy, recall and false alarm rate. In the lat-ter study, the researchers highlight the limitation of K-means clus-tering for detecting rust in grayscale images of bridge members,namely: irregular illumination of images, low-contrast images thatobscure rust areas, and debris on bridge members that create noisein the image analysis.

Clustering was widely used in image and video identification/processing. Brilakis et al. [12] developed a framework for managingdigital images of construction sites. The framework divides animage into clusters that represent different construction materialsin the image and uses the cluster features to identify the materialfrom a database of material signatures. Evaluation was performedby testing the correctness of identification of five different construc-tion materials in terms of precision, recall and effectiveness. The re-searchers describe the high accuracy of the bottom–up clusteringtechnique implemented in the method. In video image processing,several studies utilized clustering to develop a codebook – or dictio-nary – of actions and/or poses used for comparing, identifying andclassifying motions of workers in a construction activity [13,14]. Inboth studies, K-means clustering was used to limit the multitude ofpossible actions into a fixed set of poses. For evaluation, a supervisedlearning algorithmwas applied to classify the motions of workers ona test video based on the developed codebook, and performance wasdetermined based on accuracy of classification.

Several observations are noted from the above review. The majorityof the studies utilized a flat, hard clustering approach. Generally, the re-quired application dictates the choice of an appropriate clusteringmethod; e.g. when the number of resulting clusters is known – or canbe reasonably inferred – K-means clustering is an appropriate method(as in the case of detecting dark colored defect areas in gray-scale im-ages), when multiple associations are feasible, a soft clustering ap-proach is warranted. For evaluation of a clustering method andvalidation of the outcome, expert review was used in a number of thestudies. In [4], the authors note the difficulty of evaluating clusteringmethods, and report using the comparison between an outcome andthe clusters developed by domain experts as a common method formeasuring performance.

For the purpose of this study a flat, hard clustering approach isdeemed appropriate, for the reasons explained in the Methodologysection. Clustering functions on the same basic assumption asclassification—that similar documents form clusters that do not over-lap with other non-similar document clusters (also referred to as thecontiguity hypothesis). However clustering aims at identifying suchdocument clusters without any external help from previously labeledinstances (thus the unsupervised nature of the method). This is usuallyexecuted in an iterative process in which a specific procedure is repeat-ed until a predefined condition is satisfied. Two main flat clusteringtechniques are reviewed below.

2.1. K-means

In K-means clustering, a number of K centroids is defined by theuser and all instances in the dataset are assigned to the closest cen-troid (determined by Euclidean distance or cosine similarity). Then,centroids of all K clusters are calculated according to this assignmentresulting in new centroid positions. All instances in the dataset arere-assigned to the new centroids and this iterative process is contin-ued until cluster centroids remain constant, implying that the opti-mal centroid positions are identified (those that minimize thedistance between each instance in a specific cluster and the cluster'scentroid).

38 M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49

2.2. Single pass clustering

Single pass clustering, also called incremental clustering, generatesone cluster at a time using a predefined threshold value. The thresholdrepresents the user's perception of acceptable proximity, e.g. the mini-mum acceptable cosine similarity measure between instances and thecluster centroid, the maximum acceptable Euclidean distance betweenthe instances and the cluster centroid. Starting with a random instance,the closest instance in the dataset that satisfies the threshold is identi-fied and added to the cluster, and the cluster centroid is calculated.The process is repeatedwith the new cluster centroid until no instancesremain that satisfy the threshold, thereby finalizing the first cluster. Anunclustered instance is selected at random and the process is repeatedwith the remaining instances in the dataset to sequentially create newclusters until no unclustered instance remain (or until unclustered in-stances remain that do not meet the threshold standard with any ofthe formed clusters, thereby forming single-instance clusters).

3. Methodology

Since the objective is to organize construction project documentsinto semantically related groups, a hierarchical clustering structure isnot warranted, especially given the associated computational complex-ity of agglomerative clustering. For the current task, flat clustering ismore suitable and economical. The use of K-means requires pre-defining the number of clusters (cardinality) before implementing thealgorithm. It is up to the users to judge cardinality based on their knowl-edge of the domain topic of the dataset. In reality, the number of clustershas a significant impact on the results. Leaving this decision subject tothe user's judgment detracts from the automated nature of the taskand adds a high degree of subjectivity to the process. In addition to car-dinality, the choice of the initial centroids greatly impacts the clusteringresults. K-means essentially tries out various clustering outcomeslooking for the optimal outcome. While it is highly unlikely that allpossible outcomes will be tested, the fact remains that a misguided se-lection of the number and position of the initial centroidsmay unneces-sarily prolong the process or, even more critically, result in a localoptimal clustering outcome instead of the global optimum. On theother hand, single pass clustering does not require definition of cardi-nality by the user, but requires determination of a threshold definingthe boundary of similarity between documents and cluster centroids.Single pass clustering has been criticized for producing varying clusteroutcomes depending on which instances are selected to initiate theclusters and for the tendency of producing large clusters in the firstpass. In this study, single pass clusteringwas used to automatically clus-ter project documents. In order to overcome the limitations imposed bysingle pass clustering, several factors were evaluated to assess their ef-fect on clustering performance. The first factor evaluated was the effectof the value of the threshold on clustering accuracy. The predefinedvalue of the threshold has a significant impact on the clustering result:a stricter threshold decreases cluster size (thereby increasing the totalnumber of clusters) while a less strict threshold results in less clustersof larger sizes.

Predefining a threshold valuemust be viewedwithin the context of aspecific dataset. Success of single pass clustering is understandably de-pendent on the extent towhich a specific dataset satisfies the contiguityhypothesis. Relatively similar instances that are disjoint from othergroups of instances makes defining a threshold that highlights thesegroups possible. Overlapping groups of instances defies any attemptfor accurate clustering, regardless of the value of the threshold. Accord-ingly, it is the ability to magnify the similarity between same-class in-stances and the dissimilarity between different-class instances thatultimately contributes to clustering performance. One way to achievethis is by using a termweightingmethod that best depicts this attributein the dataset, accordingly two different weighting methods wereevaluated as explained below. Another way is to experiment with

dimensionality reduction of the dataset's term-document matrix (t–dmatrix), relying on the ability of latent semantic analysis (LSA) to revealthe hidden similarities among the dataset's instances.

In [15], a successive evaluation approach that implements LSA wasused to automatically classify documents of a small dataset of 17 docu-ments made up of two classes. The results showed that the differencebetween the average similarities of same-class documents and the aver-age similarities of different-class documents significantly increasedwhen dimensionality reductionwas applied using the optimumdimen-sionality factor. This suggests a polarizing effect for LSA which can beused to improve clustering results. The use of LSA implies specifying acertain dimensionality factor for the reduction step and as such the op-timumdimensionality factor (lopt) is defined in this study as the one thatresults in the highest clustering accuracy. A thorough discussion on LSAis found in [16] and a simple example demonstrating LSA's potential intext analysis is given in Appendix A.

The methodology follows four main steps: 1) collecting the dataset,2) randomizing and pre-processing, 3) developing the t–d matrix, and4) clustering and evaluation.

3.1. Collecting the dataset

Seventy-seven project documents related to eight constructionclaims make up the dataset for evaluating the developed technique.All eight claims originated from one project for the construction ofan international airport with a total value of work exceeding$50 million. The majority of the documents are correspondences be-tween the main contractor and the project engineer, detailing thefactual events related to each claim. Collected and organized by thecontract administrator of the project, the supporting documents foreach claim are a representation of a group of semantically-similardocuments, related together by their association to a specific search-able claim topic. The evaluation aims at quantitatively identifyingthe performance of the proposed technique in organizing the com-plete dataset into the correct document groups, or clusters, withoutimplementation of the learning step that characterizes supervisedlearning techniques. Individual cluster size in the dataset variesfrom a maximum of 22 documents to a minimum of five, and eachdocument belongs to only one cluster.

3.2. Randomizing and pre-processing

This step includes the tasks of tokenizing, removal of stop-words andfrequency calculation. The outcome of this step is to represent thedocuments in the dataset as vectors of varying sizes corresponding tothe features – or terms – in each document and the feature frequencyof occurrence. The order of the documents in the dataset is randomizedfrom the start to measure the consistency of the clustering outcomes.

3.3. Developing the t–d matrix

The term–documentmatrix (t–dmatrix) is the input required by theclustering algorithm in the final step of the methodology. It is a compi-lation of all the document vectors into one matrix where the columnsrepresent the documents in the dataset and the rows represent thevocabulary of the dataset. First, the vocabulary is compiled from thedocument vectors and the frequency of each term across all documentsis recorded. Then the t–dmatrix is developed based on the randomizeddocument order, in which matrix elements are calculated accordingthe specified term weighting method. In this study, two popularterm weighting methods were studied for evaluation of clusteringperformance:

• Term frequency (tf): the elements of the matrix represent thefrequency of occurrence of the term – identified by the matrix row –

in the specific document—identified by the matrix column.


• Term frequency inverse document frequency (tf–idf): modifies termfrequency based on the assumption that high occurrence terms acrossthe dataset are poor indicators of clusters. Term frequency inversedocument frequency is calculated according to Eq. (1), where n isthe number of documents in the dataset and d is the number of docu-ments containing the specific term being evaluated:

tf−idf ¼ tf � lognd: ð1Þ

3.4. Clustering and evaluation

To evaluate the effects of dimensionality factor and threshold value,single pass clustering is performed on the randomized dataset usingvarying combinations of both factors. Since the dataset is 77 documents,the dimensionality factor (l) ranges from a minimum of three dimen-sions (to minimize computational cost) to a maximum of 77 (lmax isthe special case constituting the original t–d matrix, i.e. dimensionalityreduction is not applied). The threshold value (h) represents theminimum acceptable similarity limit between a document and a clustercentroid that makes the document a candidate for inclusion in thecluster. Similarity between a cluster centroid (Cc) and a document(d) is calculated by cosine similarity using Eq. (2). Similarity theoretical-ly ranges from aminimum of zero (signifying complete dissimilarity) to

Single Pass Clustering

Start;initialize:

l= lmin, h= hmin

N

j=j+1

End

Y

Initialize:

pos= -1θmax= -99

j= j+1

j= 0

Cc= dj

dj unclustered&

Sim(Cc,dj)>θmax&

Sim(Cc,dj)≥T

Y

Cc= dj

Reconstruct

lX̂

increment l;l= lmax?N

N

Fig. 1. Clustering and evaluation fo

amaximumof one (signifying complete similarity). The threshold valuewas varied over the range [0.05, 0.95] with a step of 0.01.Maximum andminimum values for the factors were set based on experimentation, tominimize unnecessary computational cost without overlooking signifi-cant results.

simd;Cc¼ d � Cc

dj j Ccj j : ð2Þ

For a certain dimensionality/threshold combination, clustering com-mences by considering the first document in the reconstructed t–d ma-trix as the centroid of the first cluster, identifying the closest documentto the cluster that satisfies the threshold, recalculating the centroid andrepeating the process. When no documents satisfying the condition re-main, a new cluster is initiated using the first unclustered document inthedataset as the centroid of the newcluster and theprocess is repeateduntil all documents are either assigned to a cluster or cannot be assignedto any cluster and consequently form a separate single-document clus-ter. Clustering is illustrated in Fig. 1. A t–dmatrix (X̂) developed using aspecific weighting method from a randomized dataset undergoesclustering 6825 times corresponding to all possible dimensionality/threshold combinations, and clustering accuracy is calculated aftereach to determine the best performance.

l, lmin, lmax : dimensionality factor, minimum and maximum

h, hmin, hmax : threshold value, minimum and maximum

j : integer denoting t-d matrix column

Cc : cluster centroiddj : document j of t-d matrixSim(Cc,dj) : Similarity between centroid

Cc and document dj

pos : position of most similar document to cluster

θmax : cluster maximum similarity

j= 1

θmax= Sim(Cc,dj)

pos= jY

Add docpos to cluster

Recalculate Ccj= 0

Y

j≤ jmax?

N

pos> -1?Y

dj unclustered?

N

Remaining unclustered documents?

YN

Evaluate clustering

N

increment h;h= hmax?

Y

r a certain weighting method.


3.5. Clustering measures

Several clustering measures (methods for evaluating the clusteringoutcomes) are presented in [5]. A simple measure is purity calculatedby Eq. (3), where u represents a specific cluster from an outcome of iclusters, c represents a specific class from a number of j classes, N isthe total number of instances in the dataset and count(ui, cj) is the num-ber of instances belonging to class cj in cluster ui. Purity is the summa-tion across all clusters of the number of instances of the class with thehighest representation in each individual cluster, divided by the totalnumber of instances in the dataset.

Purity ¼ 1N

Ximax j count ui; c j

� �h i: ð3Þ

Purity has a range of (0,1], where poor clustering results in low pu-rity values, and good clustering results in unity. One drawback of purityis that an outcome of fragmented clusters containing same-classinstances will also result in a perfect purity score. For example, for anextreme result where each instance in the dataset is defined as asingle-instance cluster, the result will be a perfect purity measure. Theevaluation metric must fairly balance between the number of resultingclusters and the performance rating. This is particularly important forthe current dataset which exhibits large variations in the size of thedifferent classes.

The measure used for evaluating clustering outcome is F-measure.Clusters must first be decomposed into binary associations of clustermembers that are indicative of the cluster's composition. This is per-formed for the outcome generated from the clusteringprocess and com-pared using precision (P), recall (R) and F-measure with the binaryassociations generated from the true clusters. The range for the aboveevaluation method is also [0, 1]. What distinguishes this method frompurity is the balancing effect it provides as a result of combining preci-sion and recall. The number of pairwise relationships resulting from acluster made up of n instances is equal to n(n − 1) / 2. Accordingly, anoutcome of a few large clusters generates a larger number of relation-ships than an outcome of many small clusters. In the extreme case

Fig. 2. Automatic document clus

where all instances in the dataset are grouped in one cluster, 100% recallis achieved (since such clustering contains all possible pairwise combi-nations) however precision greatly deteriorates from an excess of false-positive combinations. At the other end of the spectrum in case of anoverly fragmented clustering outcome, precision is boosted if the clus-ters contain same-class instances, however recall is negatively affectedas a result of a large number of missed (false-negative) combinations.If the clusters are mainly composed of different-class instances, thenboth precision and recall values are low. In all these scenarios, thecombined F-measure score represents a balanced evaluation of theclustering outcome.

3.6. Evaluation tool

Fig. 2 illustrates the evaluation tool developed for performingclustering and evaluating the clustering outcomes. The user definesthe location of the dataset and the documents are retrieved and pre-processed as explained before. The evaluation tool allows the user tocontrol the randomization of the dataset according to a user-definedseed in order to investigate the effect of document sequence on cluster-ing outcomes. The user also has the ability to specify the following clus-tering options: dimensionality factor, threshold value and weightingmethod. At the end of a clustering run, details of themost accurate clus-tering outcome are displayed in a separate window and detailed resultsshowing clustering performance at various combinations of thresholdvalues and dimensionality factors are generated in a separate file.

4. Results and analysis

A better understanding of the clustering performance is achieved byadopting a baseline to compare the results with. A baseline gives per-spective to the results by representing the lower boundary belowwhich results are considered meaningless and unacceptable. The prob-ability of a random correct result is a common criteria used in classifica-tion evaluations for specifying a baseline. However, using the randomapproach for evaluating clustering performance will grossly underesti-mate the baseline. The number of possible cluster outcomes for n

tering and evaluation tool.

F-m

easu

re

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.2

Threshold value (h)0.4

tf

0.6 0.8 1

F-m

easu

re

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.2

Threshold value (h)0.4

tf-idf

0.6 0.8 1

Optimum

Baseline

Fig. 3. F-measure scores for lmax and lopt—average over ten trial runs.


instances grouped into K clusters is a Stirling number of the second kindnK

� �[5], calculated using Eq. (4). The number of possible outcomes in

case the number of clusters is unknown is therefore ∑nK¼1

nK

� �, the

summation of the Stirling number for all possible values of K; where Kranges from one (in case all instances are grouped into one group) ton (in case each instance is grouped alone in a single-instance cluster).

nK

� �¼ 1

K!

XKj¼0

−1ð Þ j Kj

� �K− jð Þn

� : ð4Þ

For the current dataset, the total number of possible outcomes is ex-tremely large, making the possibility of a random correct cluster null.Even if the problem is simplified by assuming that the correct numberof classes is known, the number of possible outcomes for organizing77 objects into eight groups is 8.6 × 1064 which is still very large. Therandom assumption accordingly defies the purpose of using a baseline.The baseline adopted for this task is the clustering results achieved

Term frequency

0.75 0.75 - 0.7 0.7 -

Threshold (h)0.2 0.3 0.4 0.5

Dim

ensi

onal

ity fa

ctor

(l)

0.6 0.7 0.8 0.90.1

3040

1020

5060

70

Fig. 4. Intensity grid of F-measure sco

using lmax, i.e. without any dimensionality reduction. A comparison be-tween the clustering performance and the baseline highlights the im-provement in performance resulting from applying LSA to single passclustering. If results consistently fall below the adopted baseline, thatdoes not necessarily indicate that they are meaningless, but that theproposed procedure does not offer a positive contribution to clustering.

Single pass clustering is prone to inconsistent outcomes dependingon the order of documents used in the clustering step. To ensure a rep-resentative value for clustering performance, the document order wasrandomized using different seed values and the clustering performancewas evaluated for the different document sequences. Over ten trial runs,the highest average F-measure score achieved using the tf weightingmethod was 0.68 at an optimum dimensionality factor of 13 and athreshold of 0.69, while the highest average F-measure score achievedusing the tf–idf weighting method was 0.75 at lopt = 56 and h = 0.24.Fig. 3 presents the variation of average F-measure scores across allthreshold values for two specific dimensionality factors: lmax (the base-line condition) and lopt (the highest average F-measure score achievedusing the respective weighting method). For both weighting methods,

Term frequency inverse document frequency

0.5 0.5 - 0.3 < 0.3

Threshold (h)0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Dim

ensi

onal

ity fa

ctor

(l)

1020

3040

5060

70

res—average over ten trial runs.


the baseline's performance is better at the small threshold values butgradually declines after the peak and is eventually surpassed by theoptimum's performance. For the tf weighting method, this shift occursmidway through the range of threshold values, while for tf–idf it occursat the low threshold value of 0.2.

For the tfmethod, the optimum dimensionality factor records an av-erage improvement over the baseline of 7.6%, and an 11.5% improvementin peak performance. For the tf–idfmethod, the average improvement is7% with a 5% improvement of the peak performance. These results high-light the importance of identifying the optimum dimensionality factor inorder to utilize LSA for improving clustering performance.

Fig. 4 presents intensity grids of the average F-measure scores for tentrials using both weighting methods. As can be seen from the figure, astretch of high F-measure values (indicated by the dashed lines) is ob-served spanning from high l/low h values to low l/high h values(lower left corner of the grids to upper right corner of the grids). Incase of the tf weighting method, this high performance front extendsto the mid-threshold region, while for tf–idf it spans across the limitsof both factors. These results reveal the indirect relationship betweendimensionality factor and threshold values. At high dimensionalitylevels (where little or no reduction is applied) high clustering perfor-mance requires relatively relaxed threshold values. Reducing the num-ber of dimensions results in improved class separation allowing the useof a stricter similarity definition (i.e. higher threshold values) which at-tests to the contribution of dimensionality reduction in polarizing same-class instances in the dataset.

The regions of high F-measure scoresweremore prominentwith thetf–idfweighting method suggesting the superiority of this method overthe tfweighting method. This observation is attributed to the method'saccurate identification of relative term weights that better reflect simi-larities and therefore result in improved clustering performance. Nospecific region had an average F-measure score higher than 0.7 usingthe tf weighting method. For the tf–idf weighting method, two promi-nent regions of highest F-measure scores are apparent, one in the fifty'srange of dimensionality factors arounda threshold of 0.25, and the otherwithin the threshold range of [0.60, 0.70] at a dimensionality factor of15. Table 1 identifies the maximum F-measure results achieved, and

Table 1Results of various clustering trials.

Trial run Term frequency

Dim factor (l) Thresh (h) Precision Recall F-measu

0 10 0.77 0.694 0.667 0.6811 8 0.86 0.621 0.600 0.6102 13 0.69 0.699 0.611 0.6523 13 0.69 0.850 0.805 0.8274 9 0.8 0.656 0.713 0.6835 13 0.69 0.721 0.708 0.7156 6 0.88 0.607 0.801 0.6917 13 0.69 0.730 0.760 0.7458 13 0.69 0.691 0.688 0.6899 13 0.69 0.769 0.747 0.75810 13 0.69 0.698 0.738 0.71711 13 0.69 0.850 0.810 0.83012 10 0.76 0.682 0.753 0.71613 13 0.69 0.744 0.735 0.73914 40 0.49 0.637 0.670 0.65315 13 0.69 0.850 0.810 0.83016 13 0.69 0.740 0.613 0.67117 13 0.69 0.761 0.758 0.76018 13 0.69 0.769 0.747 0.75819 9 0.8 0.656 0.713 0.68320 9 0.81 0.643 0.591 0.616Mean N/A N/A 0.718 0.716 0.715St. dev. N/A N/A 0.073 0.070 0.064

the corresponding l and h factors, for multiple trial runs of the proposedtechnique. While the absolute maximum of each individual trial runvaries, in general F-measure scores for the regions mentioned abovewere consistently high over the trial runs.

For the tfweightingmethod, while the combination of a dimension-ality factor of 13 with a threshold of 0.69 was prevalent in most trials,the values of the F-measure score for such combinations varied signifi-cantly from a minimum of 0.69 to a maximum of 0.83. This suggestsinconsistencies in the clustering results. Such inconsistencies are not ap-parent in the top prevalent factor combinations of the tf–idf weightingmethod. The most common combination is a dimensionality factor inthe range [54, 57] with a threshold of 0.24 for which the F-measurescores were approximately 0.78. Another common (l, h) combinationis the (15, 0.69) combination which achieved a constant F-measurescore close to 0.75.

In order to accurately check consistency of the clustering results, theactual clusters created by the different trial runs were examined. Fig. 5displays the clusters for the highest and lowest F-measure scoresachieved in the trial runs using the tf weighting method. Noting thatthe true number of classes is eight and the smallest class contains fivedocuments, the resulting clusters are considered fragmented. Discrepan-cies are observed between the two cases of the tf weighting method interms of the number and composition of clusters. In addition, cluster im-purity is evident, not only for the low F-measure case (clusters 1, 4, 5 and10) but also for the high score case (cluster 1). Fig. 6 displays the clustersfor the highest and lowest F-measure scores achieved in the trial runsusing the tf–idf weighting method. Both results are highly fragmentedwith a different number of clusters in each case. However, although notcompletely identical, cluster composition is similar, impurity is limitedand clusters make a good representation of the true classes.

Examination of the precision and recall values behind the F-measureresults in Table 1 offers an explanation for this observation. Average pre-cision over all trial runs was higher for the tf–idfmethod, while averagerecall was higher for the tfmethod. For the tfmethod, values for preci-sion and recall for each separate trial run were comparatively close,while values of precision were significantly higher than recall for thetf–idf method. This discrepancy between the two methods explains

Tem frequency inverse document frequency

re Dim factor (l) Thresh (h) Precision Recall F-measure

15 0.69 0.905 0.627 0.74115 0.69 0.905 0.627 0.74154–57 0.24 0.977 0.658 0.78615 0.68–0.69 0.905 0.627 0.74115 0.69 0.905 0.627 0.74169–76 0.2 0.933 0.751 0.83254–57 0.24 0.983 0.654 0.78566–77 0.18 0.859 0.747 0.79915 0.68–0.69 0.922 0.640 0.75661–64, 67 0.2 0.898 0.733 0.80766–77 0.18 0.859 0.747 0.79954–57 0.24 0.950 0.645 0.76853–57 0.24 0.977 0.658 0.78654–57 0.24 0.983 0.654 0.78554–57 0.24 0.950 0.645 0.76854–57 0.24 0.950 0.645 0.76869–73 0.2 0.979 0.738 0.84154–57 0.24 0.977 0.658 0.78661–64, 67 0.2 0.843 0.756 0.79715 0.68–0.69 0.922 0.640 0.75669–76 0.2 0.933 0.751 0.832N/A N/A 0.929 0.677 0.782N/A N/A 0.043 0.051 0.031

Cluster 1: D1 D2 D4 D5 G7 D3

Cluster 2: B 1 B 12 B 11 B 3 B 10 B 2 B 9 B 8 B 7 B 2 0 B 13 B 18 B 17

Cluster 3: C 1 C 11

Cluster 4:Cluster 5: G1 G4 G3 G6 G5 G2

Cluster 6: H1 H4 H3 H2

Cluster 7: C 3 C 7 F2

Cluster 8: C 4

Cluster 9:Cluster 10: A 1 A 3 A 7 A 8 A 9 A 6 A 5 A 4 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 A 17 A 2 1 A 15

Cluster 11: C 5

Cluster 12: H11 H13

Cluster 13: F1 F3 F5 F7 F4 F6

Cluster 14: C 9

Cluster 15:Cluster 16: A 18

Cluster 17: A 19

Cluster 18: A 2 3

F-measure= 0.84; (l, h)= (69, 0.20); Seed(576168)

Cluster 1:Cluster 2: B 1 G5 G2 G1 G3 G4

Cluster 3: C 1

Cluster 4: C 3 C 7 C 9 C 4 G6 F2

Cluster 5:

E1 E3 E16 E13 E19 E14 E17 E6

E12

E18

H1 H4 H3 H2 H13 H11

E1 E3 E16 E13 E12 E19 E14 E18 E17 E6 A 18

Cluster 6: D1 D4 D5 G7 D2 D3

Cluster 7: B 2 B 7 B 13 B 18 B 2 0 B 9 B 8 B 10 B 17

Cluster 8: B 3 B 11 B 12


Cluster 10: A 1 A 4 A 5 A 6 A 7 A 9 A 8 A 3 A 15 A 2 1 A 17 A 2 2 A 2 3 A 16 A 2 0 A 10 A 19

Cluster 11: C 5

Cluster 12: C 11

Cluster 13: A 11 A 12

Cluster 14: A 13

Cluster 15: A 14

F-measure= 0.74; (l, h)= (15, 0.69); Seed(8)

Fig. 6. Clustering results using tf–idf weighting method.

Cluster 1: H1 D2 H3 H4 H2 H11 E18 E12 C 4 E19 E13 E17 A 18 H13 E16 E14 C 7 E6 A 19

Cluster 2: B 1 B 11 B 12 B 3 B 10 B 18 B 2 0 B 13 B 2 B 7 B 9 B 8

Cluster 3: C 1

Cluster 4: C 3 F2 G6

Cluster 5: G1 G2 G5 G4 D4 G7 D5 D3

Cluster 6: E1 E3

Cluster 7: D1


Cluster 9: A 1 A 5 A 6 A 7 A 9 A 8 A 3 A 4 A 10 A 16 A 2 2 A 2 0 A 2 3 A 2 1 A 15 A 17

Cluster 10: G3 C 5

Cluster 11: B 17

Cluster 12: C 11

Cluster 13: A 11 A 12 A 14

Cluster 14: A 13

F-measure= 0.61; (l, h)= (8, 0.86); Seed(8)

Cluster 1: G1 G2 G5 G7 D4 D5 D1 G3 D3 G6 G4 D2

Cluster 2: E1 E3 E16 E13 E12 E14 E19 E17 E18 E6

Cluster 3: C 1


Cluster 5: H1 H3 H4 H2 H11 H13 C 4

Cluster 6: A 1 A 4 A 8 A 7 A 9 A 6 A 5 A 3 A 14 A 2 2 A 16 A 2 0 A 10 A 2 1 A 15 A 17 A 2 3 A 19 C 9 A 11

Cluster 7: B 1 B 11 B 12 B 3 B 10 B 13 B 18 B 2 B 7 B 9 B 8 B 2 0 B 17

Cluster 8: F2 C 3 C 7 C 5

Cluster 9: C 11

Cluster 10: A 12

Cluster 11: A 13

Cluster 12: A 18

F -measure= 0.83; (l, h)= (13, 0.69); Seed(572639)

Fig. 5. Clustering results using tfweighting method.

tf P= 0.744, R= 0.735 F-measure= 0.740

tf-idf P= 0.905, R= 0.627 F-measure= 0.741


Cluster 2: B 1 B 11 B 12 B 3 G1 B 10 B 13 B 18 B 2 B 7 B 9 B 8 B 2 0 B 17

Cluster 3: A 1 A 4 A 8 A 7 A 9 A 6 A 5 A 3 A 14 A 2 2 A 16 A 2 0 A 10 A 2 1 A 15 A 17 A 2 3 A 19 C 9 A 11

C1G7 D5 G2 G3 D3 G6 G4 G5

Cluster 4:Cluster 5: D1 D4Cluster 6: C3 C4 C7 E12 C5 E13 E18F2 E6 H13 E19 E17 E14Cluster 7: D2 H1 H3 H4 H2 H11Cluster 8: E1 E3 E16Cluster 9: C11Cluster 10: A12Cluster 11: A13Cluster 12: A18

Cluster 1: H1 H4 H3 H2 H13 H11

Cluster 2: B 1 G5 G2 G1 G3 G4

Cluster 3: C 1

Cluster 4: C 3 C 7 C 9 C 4 G6 F2

Cluster 5: E1 E3 E16 E13 E12 E19 E14 E18 E17 E6 A 18


Cluster 7: B 2 B 7 B 13 B 18 B 2 0 B 9 B 8 B 10 B 17




Cluster 11: C 5

Cluster 12: C 11


Cluster 14: A 13

Cluster 15: A 14

Fig. 7. Two clustering outcomes similar in F-measure scores and varying in precision and recall.

ts : set of test documents (outlier documents)tr : Set of training documents (clusters)n : number of outlier documentsi : integer denoting a specific document in

the test setXtr,ts(i) : t-d matrix based on training set tr and test

document ts(i)

l : dimensionality factor

Start

Y

Increment i;i < n ?

EndN

Reconstruct Classify ts(i)

Determine ts, nand tr;

Initialize i= 0

Add ts(i) to tr;

DetermineXtr,ts(i,j)

lX̂

Remove ts(i)from tr

Evaluate finalclustering

Fig. 8. Cluster refinement process.


0.5

0.6

0.7

0.8

0.9F

-mea

sure

Trial runs

Optimum - RefinedBaseline - RefinedOptimum - OriginalBaseline - Original

Fig. 9. Comparison of F-measure results across trial runs.

Table 3Matrix of average precision scores for baseline and optimum cases, before and afterrefinement.

Category Baseline Optimum Difference

Original 0.774 0.929 0.155Refined 0.637 0.819 0.182Difference −0.137 −0.110 0.045


the clustering outcomes illustrated in the above figures. As discussedabove, a small number of clusters in an outcome produces a high recallresult, but also increases cluster impurity especially if the number ofresulting clusters is less than the number of true classes. Conversely, ahigh precision value is generated if the outcome contains a large num-ber of clusters, provided that the clusters are made up of same-class in-stances (i.e. the case of class fragmentation). These results suggest thatthe optimum outcome of a trial run using the tf–idf method tends tohave a relatively high precision result and a moderately high recallresult.

5. Clustering using a hybrid approach

Fig. 7 displays two different clustering outcomes with an almostidentical F-measure score, one for each weighting method. The generalcharacteristics of fragmentation and impurity discussed in the previoussection apply to both cases. If the small clusters – the group of outliers –in the tf–idf outcome are ignored, the remaining large clusterswithmin-imal impurity can still make an acceptable representation of every trueclass in the dataset. For example, class A is represented by cluster 10,class B by cluster 7, class C by cluster 4, etc. The same cannot be saidfor the tf outcome due to the high impurity of cluster 5 (a combinationof classes D and G) and cluster 6 (composed mainly of classes C and E,with a couple of instances from other classes). The tf–idf clustering out-come can therefore be reformulated as a classification problem, by split-ting the outcome into a test set made up of the outlier cases and atraining set consisting of the remaining clusters. Refinement of a high-precision average-recall clustering outcome is possible by a secondaryclassification step in which each outlier is classified to one of the largeclusters. This hybrid approach therefore combines an unsupervisedlearning method (single pass clustering) with a supervised learningmethod (text classification) with the objective of improving clusteringperformance by reducing fragmentation.

Fig. 8 outlines the process used for refining cluster outcomes andevaluating the technique. The process is preceded by performing singlepass clustering on the dataset and defining a specific outcomewhich theprocess aims at improving. The first step in the refinement process is todefine the training and testing sets for the classifier. Outlier instances

Table 2Matrix of average F-measure scores for baseline and optimum cases, before and after re-finement.


Original 0.716 0.782 0.066Refined 0.735 0.844 0.109Difference 0.019 0.062 0.128

are defined based on a minimum cluster size (smin). Members of anycluster in the original outcome that fails to satisfy theminimumare con-sidered outliers and included in the test set. Accordingly, the larger theminimum limit the smaller the number of clusters in the final outcome.Selecting the minimum cluster size is judgmental, based primarily onknowledge of the dataset and whether or not large clusters are expect-ed. Outliers are extracted and the remaining clusters form the trainingset and are considered the classes used for classification. The morethese clusters correlate with the true classes in the dataset (i.e. thelower their impurity and the better they represent each of the true clas-ses) the better the chances of an improved clustering outcome afterrefinement.

Having identified both sets, each individual test document is addedto the training set in order to be classified to one of the clusters. Witheach addition, the t–dmatrix is developed and then reduced to a dimen-sionality level that exposes similarities between documents in the set tofacilitate the classification step. Based on the results of the evaluation ofdifferent text classifiers at varying dimensionality factors in [17], aRocchio classifier was implemented for the hybrid approach using a di-mensionality level of approximately 67% of the available dimensions. Fi-nally, each outlier is classified and grouped with the closest cluster andthe refined outcome is evaluated using F-measure to enable comparisonbetween the original clustering outcome and the refined outcome. Theoutcomes from the trial runs previously performed were used to evalu-ate the refinement process to obtain a representative estimate of theprocess's effect on clustering.

Fig. 9 displays the results of evaluating the hybrid clustering ap-proach using a minimum cluster size (smin) of four. The same baselineas before was used after considering the optimum threshold for eachcase (i.e. the highest result achieved using lmax across the range ofthreshold values versus the highest result achieved using lopt). Table 2is a matrix of the average F-measure scores of the trial runs for thedifferent combinations of original/refined, baseline/optimum. Table 3and Table 4 are the equivalent matrices for precision and recall.

In general, the optimum cases displayed better precision and F-measure scores than the baseline cases. This indicates LSA's contributionto improved clustering results, but also highlights the importance ofidentifying the appropriate dimensionality factor in order to achievesuch improvements. The optimum cases demonstrated a slight deterio-ration in recall from the baseline, but not significant enough to preventan improvement in F-measure scores due to a high increase in theoptimum's precision.

A surge in recall and a drop in precision are observed between theoriginal and refined states. The increase in recall is attributed to a reduc-tion in the total number of clusters in the final outcome as a result of

Table 4Matrix of average recall scores for baseline and optimum cases, before and afterrefinement.


Original 0.678 0.677 −0.001Refined 0.880 0.875 −0.005Differences 0.201 0.197 0.197

C1

C11

Cluster 1: F1 F2 F3 F5 F4 F6 F7 Cluster 1: 11H7F6F4F5F3F2F1F

Cluster 2: E1 E3 E16 E13 E19 E14 E17 E6 Cluster 2: 31HE6E17E14E19E13E16E3E1

Cluster 3: G1 G4 D2 G3 G2 G5 G6 G7 Cluster 3: G1 G4 D2 G3 G2 G5 G6 G7 E12

Cluster 4: D1 D4 D5 D3 Cluster 4: D1 D4 D5 D3 C5

Cluster 5: B 1 B 12 B 11 B 2 0 B 3 B 10 B 9 B 8 B 2 B 7 B 13 B 18 B 17 Cluster 5: B 1 B 12 B 11 B 2 0 B 3 B 10 B 9 B 8 B 2 B 7 B 13 B 18 B 17 A15

Cluster 6: H1 H4 H3 H2 Cluster 6: H1 H4 H3 H2 A17

Cluster 7: A 1 A 3 A 7 A 8 A 9 A 4 A 6 A 5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 Cluster 7: A 1 A 3 A 7 A 8 A 9 A 4 A 6 A 5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 A21

Cluster 8: C 1 C 11 Cluster 9: C 3 C 7 C 4 C 9 E18

Cluster 9: C 3 C 7 C 4 C 9 A18

Cluster 10: H11 H13 A19

Cluster 11: E12 A23

Cluster 12: C 5

Cluster 13: A 15 A 17 A 21

Cluster 14: E18

Cluster 15: A 18

Cluster 16: A 19

Cluster 17: A 2 3

(l, h)= (57, 0.24); Seed(23)

Tra

inin

g S

et

Test S

et

F-measure= 0.786; P= 0.977; R= 0.658

Cluster 1: F1 F2 F3 F5 F4 F6 F7

Cluster 2: E1 E3 E16 E13 E19 E14 E17 E6 E12 E18 A 18

Cluster 3: G1 G4 D2 G3 G2 G5 G6 G7

Cluster 4: D1 D4 D5 D3 C 1

Cluster 5: B 1 B 12 B 11 B 2 0 B 3 B 10 B 9 B 8 B 2 B 7 B 13 B 18 B 17 C 11

Cluster 6: H1 H4 H3 H2 H11

Cluster 7: A 1 A 3 A 7 A 8 A 9 A 4 A 6 A 5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 H13 A 15 A 17 A 2 1 A 19 A 2 3

Cluster 9: C 3 C 7 C 4 C 9 C 5

(l, h)= (57, 0.24); Seed(23)F-measure= 0.893; P= 0.879; R= 0.907

Fig. 10. Cluster outcome refinement example.

0.8

0.8 - 0.7

0.7 - 0.5

0.5 - 0.3

< 0.3 7060

5040

30

Dim

ensi

onal

ity fa

ctor

(l )

1020

0.90.3 0.4 0.5 0.6 0.7 0.80.1 0.2Threshold (h)

Region 1

Region 2

Fig. 11. Intensity grid of average F(0.5) scores of ten trial runs.


classification of the outliers, which is expected since the objective of therefinement process is to improve recall by reducing fragmentation. Thedecrease in precision is a result of impurity, not only from themisclassi-fication of the outliers, but also from the original pre-refined clusters.This tradeoff between the change in precision and recall before andafter refinement resulted in an increase in F-measure scores for boththe baseline and optimum cases of 1.9% and 6.2%, respectively. Overall,all three metrics experienced an increase from the original-baselineaverages to the refined-optimum averages of 4.5%, 19.7% and 12.8% forprecision, recall and F-measure, respectively.

A closer look at an actual refined outcome will give substance to theabove results. Fig. 10 illustrates the clustering refinement process for asample outcome. Only one cluster in the original outcome (Cluster 3)contains a misplaced document. This explains the very high precisionvalue. The outcome is also highly fragmented, which explains themedi-um recall value: fragmentation increases the number of false-negativepairwise relationships and consequently reduces recall. Separation ofthe outliers using aminimumcluster size of four results in eight remain-ing clusters used as the training set for the classification step that arevery low in impurity and thatmake a good representation of true classesof the dataset. Four of the 13 documents in the test set were classifiedincorrectly, reducing the precision value for the refined outcome. How-ever, accurate classification of the majority of the outliers resulted in alarge gain in recall (due to the decrease in the number of false-negative pair-wise relationships) ultimately causing a 10.7% increasein the F-measure score.

The success of the proposed hybrid clustering technique thereforerelies on:

• creating ‘good clusters’ using the single pass clustering step—lowimpurity clusters that match the true classes in the dataset and conse-quently ensure easier classification of the outlier documents in thenext step, and

• selecting the appropriateminimumcluster size that best distinguishesbetween clusters and outliers (training set and test set) for the classi-fication step.

5.1. Selection of dimensionality and threshold values

Precision of the clustering outcome after the initial step of singlepass clustering is a good indicator of the degree of impurity of the gen-erated clusters; the higher the precision the less impure. However,

while precision is more important, completely neglecting recall willsingle out for the classification step an extreme result of a completelyfragmented outcome that has a perfect precision value but is composedof a large number of single-instance or low-size clusters—a result thatis unsuitable for the classification step. A moderate recall value isrequired to cause the necessary balance between low impurity andfragmentation.

As demonstrated above, the tf–idfmethod's optimum clustering out-come produced fairly consistent results over many trial runs and weregenerally characterized by high precision values and moderately highrecall values (a mean of 0.93 for precision with a standard deviation of0.04, and a mean of 0.68 for recall with a standard deviation of 0.05).‘Good clusters’ result from a good choice of factors for the single passclustering step – dimensionality and threshold – that consistentlygenerate high clustering performance. Fig. 4 above identifies two mainregions of high F-measure scores for the tf–idf weighting methodbased on the average scores of multiple trial runs: a low-threshold/high-dimensionality region, and a high-threshold/low-dimensionalityregion. So far, the F-measure used in all calculations is the F(1) score;a balanced F-measure that gives equal weights to precision and recall(calculated using F − measure= (β2 + 1)PR/(β2 × P+ R) and settingβ=1). Using an unbalanced F-measure, which gives a small advantage

Region 1Dimensionality factor = 16; Threshold value= 0.67

Region 2Dimensionality factor = 56; Threshold value= 0.24

0.5

0.6

0.7

0.8

0.9

1.0

F(0.5) Precision Recall0.5

0.6

0.7

0.8

0.9

1.0

F(0.5) Precision Recall

Fig. 12. Consistency of maximum performance factor combinations over 100 trial runs.


to precision over recall, gives a better picture of the range of factorvalues that are more likely to produce good clusters. Fig. 11 representsthe intensity grid of the average F(0.5) score for the same trial runs. Aprominent area of high clustering performance appears within the[10, 20] dimensionality range and the [0.7, 0.85] threshold range (region1). The highest average F(0.5) was 0.85 at a dimensionality factor of 16and a threshold of 0.67. A smaller region of high performance appearswithin the [0.20, 0.25] threshold range and the fifties dimensionalityrange (region 2). The highest score achieved in this region was 0.84 ata threshold of 0.24 and the dimensionality factors 56 and 57.

To test consistency of results, the factor combination with the max-imum result in each region was tested for 100 trials after randomizing


Cluster 2: C 1


Cluster 4:Cluster 5:Cluster 6: D1 D4 D5 G7 D2 D3

Cluster 7: B 2 B 7 B 13 B 2 0 B 18 B 9 B 8 B 10 B 17



Cluster 10: F4

Cluster 11: C 3 C 7 C 9 C 4 G6

Cluster 12:

E1 E3 E16 E13 E12 E19 E14 E18 E17 E6

H1 H4 H3 H2

H11 H13

Cluster 13: C 5


Cluster 15: C 11

Cluster 16: A 13

Cluster 17: A 14

Cluster 18: A 18

Outcome R1A

Cluster 1:Cluster 2: C 1

Cluster 3: C 3 C 7 C 9 C 4 F2 G6 E6




Cluster 8: F1 F7 F5 F6 F3

Cluster 9: B 2 B 7 B 13 B 2 0 B 18 B 9 B 8 B 10

Cluster 10: A 1 A 4 A 5 A 6 A 7 A 9 A 8 A 3


Cluster 12: C 11

Cluster 13: F4

Cluster 14:

H1 H4 H3 H2

E1 E3 E16 E13 E12 E19 E14 E18

H11 H13


Cluster 16: A 13

Cluster 17: A 14

Cluster 18: A 18

Outcome R

Fig. 13. Clustering outcome

the sequence of documents. Fig. 12 displays the variation of the evalua-tion metrics across the trial runs. The average F(0.5) score for both re-gions across the 100 trials was the same (0.85), however the resultsfrom region 1 were more consistent. The standard deviation for allthree evaluation metrics in region 1 was 0.01, while the standard devi-ations for precision, recall and F(0.5) in region 2 were 0.07, 0.01 and0.03, respectively. Moreover, whereas 37 trial runs resulted in a preci-sion value less than 0.9 for the region 2 factors, the lowest precisionvalue for a trial run using the region 1 factors was 0.91.

A look at the actual clusters formed by the factors of each regiongives a good indication of consistency of results. Over the 100 trialruns, region 1 factors generated three unique outcomes, illustrated in


Cluster 2: H1 H4 H3 H2

Cluster 3: G1 G3 G2 G5 B 1 G4

Cluster 4: C 1

Cluster 5:Cluster 6: C 3 C 7 C 9 C 4 F2 G6


Cluster 8: C 5

Cluster 9: F1 F7 F5 F6 F3


Cluster 12: B 2 B 7 B 13 B 2 0 B 18 B 9 B 8 B 10 B 17


Cluster 14: F4


Cluster 16: A 13

Cluster 17: A 14

Cluster 18: A 18

Outcome R1B

B 17

A 15 A 2 1 A 17 A 2 2 A 2 3 A 16 A 2 0 A 10 A 19

E1 E3 E16 E13 E12 E19 E14 E18 E17 E6

H11 H13

E17

1C

s for region 1 factors.

-30%

-20%

-10%

0%

10%

20%

30%

2 3 4 5

% C

han

ge

smin

Precision Recall F-measure

Fig. 14. Change in average clustering performance between original and refined outcomes.


Fig. 13. Outcome R1A occurred 76 times, while outcomes R1B and R1Coccurred 16 and 8 times, respectively. The three outcomes are identicalin the number of clusters formed and the composition of each cluster,except for a disagreement over the clusters for documents F2 and E6.On the other hand, while the highest F(0.5) score in all 100 trials forboth regions was based on an outcome using region 2 factors, suchfactors generated 10 unique outcomes ranging in F(0.5) scores from aminimum of 0.62 to a maximum of 0.89.

While factor combinations from region 2 have the potential ofproducing outcomes that have higher F-measure scores, results varydepending on the order of the documents used during single pass clus-tering. This can be attributed to the low threshold value and highdimensionality factor of region 2. With high dimensionality, optimalseparation of similar instances is not achieved, and a lower thresholdis required to achieve good clustering performance. Under these condi-tions, the number of candidates that satisfy the similarity limit for aforming cluster increases thereby increasing the competition betweenclusters over the instances. Since clusters are formed one at a timebased on the order of the instances, an early forming cluster developswith a larger pool of candidate instances, thus given priority over alate forming cluster. The outcome is therefore susceptible to suchorder. Conversely for region 1, the polarizing effect of a low dimension-ality factor results in effective separation of same-class instances thusallowing the use of a relatively high threshold value.With limited com-petition between clusters over the documents as a result of the strictersimilarity threshold, the outcomes of clustering are fairly consistentregardless of the sequence of documents used in the process.

5.2. Choice of minimum cluster size

The choice of theminimum cluster size has an impact on the refinedoutcome's final F-measure score. In Fig. 10, if three is used as the

Cluster 1: F1 F2

Cluster 2:Cluster 3: G1 G4

Cluster 4: B 1 B 12

Cluster 5: A 1 A 3

Cluster 1: F1 F2 F3 F5 F4 F6 F7

Cluster 2: A 18

Cluster 3: G1 G4 D2 G3 G2 G6 G7 G5

Cluster 4: D1 D4 D5 D3

Cluster 5: B 1 B 12 B 11 B 2 0 B 3 B 10 B 9 B 8 B 2 B 7 B 18 B 13 B 17

Cluster 6:Cluster 7: A 1 A 3 A 7 A 8 A 9 A 4 A 6 A 5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 A 19 A 2 3

Cluster 8: C 1 C 11 C 5

Cluster 9: C 3 C 7 C 4 C 9

Cluster 10:

E1 E3E1 E3 E16 E13 E19 E14 E17 E6 E12 E18

H1 H4 H3 H2

H11 H13

Cluster 11: A 15 A 17 A 2 1

F-measure= 0.855; P= 0.953; R= 0.776

smin

= 2

Fig. 15. Effect of minimum clusteri

minimum cluster size, then cluster 13would be included in the trainingset thereby splitting class A in the refined outcome. In this case the in-crease in recall will not be the same as for the case of using a minimumcluster size of four and accordingly a lowerfinal F-measurewould be ex-pected. Evaluating with a minimum cluster size of three for the aboveexample yielded P = 0.86, R = 0.778 and F-measure = 0.817; only a3.1% improvement over the original clustering outcome. To measurethe effect of the minimum cluster size on the refined outcome, theabove evaluation was repeated for different values of smin rangingbetween five (the size of the smallest class in the dataset) and two(where only single-instance clusters in the original outcomes are de-fined as outliers). Fig. 14 illustrates the average difference in precision,recall and F-measure scores between the original and refined outcomesat differentminimumcluster size values. For all values of smin, the gain inrecall after the refinement process overcomes the loss in precisionresulting in a positive increase in F-measure scores, except at a mini-mum cluster size of 5 for which a loss in performance occurs after re-finement. The highest gain in F-measure scores was at a minimumcluster size of four.

A comparison of the final refined outcome at both extremes of thesmin range reveals the consequences of selection of a specific minimumcluster size. Fig. 15 displays two outcomes of the same trial run basedon a minimum cluster size of five and two. At the high end, the numberof clusters used for classification tends to be low compared to the num-ber of true classes in the dataset and the number of outliers tends to behigh. Sincewhole classes aremissing from the training set, classificationaccuracy is expected to be very low. Accordingly, even if such clustersinitially have low impurity, classification quickly erodes this advantageand the gain in recall is not sufficient to make any positive impact onthe final F-measure score. This case is impractical for informationretrieval purposes as the composition of the resulting clusters is toodiverse to allow any reasonable assessment of the clusters' content.

F3 F5 F4 F6 F7

C 9 C 5 A 15 A 18

D2 G3 G2 G6 G7 G5 C 1

B 11 B 2 0 B 3 B 10 B 9 B 8 B 2 B 7 B 18 B 13 B 17 C 11

A 7 A 8 A 9 A 4 A 6 A 5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 C 3 C 7 C 4 A 17 A 21 A 19 A 23

F-measure= 0.700; P= 0.595; R= 0.851s

min= 5

E16 E13 E19 E14 E17 E6 H1 H3 H2 H11 E12 E18

D1 D4 D5 D3 H4

H13

ng size on cluster refinement.

Table A.1Original t–d matrix for example.

D1 D2 D3 D4 D5 G1 G2 G3 G4 G5 G6 G7

0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0

0 1 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 1 1 1

0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 1 0 0 1 0 0 0

0 0 0 0 0 1 1 1 1 1 1 1

0 0 0 0 0 0 1 0 0 0 0 0

0 1 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 1 0 0

1 0 1 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0 0 0

1 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 1 0 0 0 0 0 0 0

0 0 1 0 1 1 0 0 1 0 1 1

0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0

Adjacent

Airport

Approval

Area

Continuation

East

Extension

Fence

Gate

Mobilization

North

Office

Old

Re−mobilization

Relocation

Site

Stop

Temporary

Work 0 0 0 0 0 0 0 1 0 0 0 0


The other end of the spectrum (smin = 2) is the case where outliers areonly single-instance clusters. For this case, the number of clusters in thetraining set tends to be high in comparison with the number of trueclasses in the dataset, and the number of test documents tends to below. Due to fragmentation of the original outcome, classes may be rep-resented by more than one cluster in the training set. As such, there is abetter chance of grouping outlierswith similar-class documents therebyincreasing recall, and limiting any decrease in precision. However, withmultiple classes split over more than one cluster, the final result is stillhighly fragmented. This case could be considered as a very conservativeclustering approach and, for practical purposes, can be used as an initialstep for simplifying a large dataset into smaller groups of very similardocuments.

6. Summary and conclusion

When the project document corpus is complete and appropriate-ly organized (e.g. for previously completed projects), in such case theuse of text classifiers for document retrieval is suitable. However inmany cases, the document corpus is gradually and continuously de-veloping (such as the case of an ongoing project) and the classes re-quired for training in a supervised learning method are not readilyavailable. Particularly when classes are not predetermined and donot cover the whole spectrum of possible categories the applicationof text classification is not straightforward. In this study, an unsuper-vised learning method was adapted and evaluated for the task ofclustering documents based on textual similarity into sets of docu-ments that are semantically related. The single pass clustering algo-rithm was adopted instead of the popular K-means clusteringalgorithm to avoid the requirement for a predetermined user-defined cardinality (number of resulting clusters) associated withthe latter. However, single pass clustering requires definition of aminimum threshold similarity measure that indicates during theclustering process whether a specific instance belongs to a specificcluster. In addition, single pass clustering is prone to variable cluster-ing outcomes depending on the sequence of the instances used in theclustering process. Single pass clustering was performed on the samedataset under varying threshold values and dimensionality factors toevaluate the ability to identify the correct clusters within the dataset.Results indicate the indirect relationship between threshold and di-mensionality: low dimensionality factors require high thresholdvalues to achieve good clustering results and vice-versa. For the cur-rent dataset, a low dimensionality factor and a high threshold valuedemonstrated the best performance in terms of precision and consis-tency resulting in an average F-measure score of 0.782, a 6.6% in-crease over the baseline. To boost recall, single pass clustering wasfollowed by a cluster refinement step in which the resulting clusterswere used to train a text classifier for classifying outliers. The averageF-measure score after refinement was 0.844, a 6.2% improvementover the unrefined result (12.8% improvement over the originalbaseline). The results were based on repeated trials of different ran-domizations of the dataset in order to obtain representative values ofthe performance. In general, it can be concluded that results im-proved when some level of dimensionality reduction was applied.However, the evaluation showed that the relationship betweendimensionality factor and threshold value is not constant, i.e. that amisguided choice of dimensionality reduction can result in perfor-mance deterioration.

Results of the evaluation show that textual similarities can be usedto reveal semantic relations between documents in the dataset. For doc-umentmanagement, this can be used to organize an unorganized docu-ment corpus – whether of an ongoing project or a previous project'sunclassified corpus – into semantically related groups. The advantageof doing so is realized at the document retrieval stage: a search of thedocuments, whether by keywords and/or metadata, not only returnsthe relevant documents (those satisfying the user-defined keywords

and/or metadata), but also returns other related documents in the clus-ter even if their similarity with the keywords is low or if they do not sat-isfy the metadata criteria [5]. This ensures high recall and guaranteesaccess to the relevant information in the documents. While the pro-posed approach overcomes the all inclusive class limitation of text clas-sifiers, the assumption of mutually exclusive clusters remains alimitation of the approach. In practice, project documents may belongto discourses of multiple knowledge topics and assigning a documentto one and only onemay cause knowledge gaps in others. Theoretically,the technique may be modified to adopt an any-of approach instead ofthe current one-of approach; however, evaluation will require a differ-ent dataset since all classes in the current dataset are mutuallyexclusive.

Another limitation is dictated by the size of the dataset used for eval-uating the proposed methodology. The impact of the size of the dataseton the results is arguable. On one hand, a large dataset produces a larget–d matrix which complicates matrix operations and increases compu-tational cost. Also a large dataset increases the chance of noisy datawhich adversely affect the performance of the text analysis techniques.On the other hand, a small dataset, while easier to manipulate, offers asmaller feature set. Scarcity of features – the evidence used to performthe required text analysis task – can undermine the performance ofthe evaluated classification or clustering technique. Accordingly, cau-tion should be exercised in extrapolating the results to other datasets.The dataset and the resulting vocabulary are relatively small makingany generalizations of the results unjustifiable absent further experi-mentation on other datasets.

Appendix A

The following example illustrates application of LSA. The sampleis made up of 12 documents organized into two classes, classes Dand G. Only the documents' subject headers are used in the analysis(as opposed to the full document body) in order to limit the size ofthe t–d matrix. The original t–d matrix – based on term frequencyand the reduced t–d matrix—after applying a dimensionality factorof 4 – are presented in Tables A.1 and A.2, respectively. On the docu-ments' side, average pairwise similarity between document vectorsof classes D and G increase after applying LSA from 0.28 and 0.44 to0.39 and 0.72, respectively. On the terms' side, similarity betweenthe vectors of the terms ‘remobilization’ and ‘relocation’ – whichwere used interchangeably – increased from 0 to 0.95 after dimen-sionality reduction. Similarly, similarity between the terms ‘fence’and ‘gate’ increased from 0.38 to 0.87.

Table A.2Reduced t–d matrix for example.

D1 D2 D3 D4 D5 G1 G2 G3 G4 G5 G6 G7

Adjacent 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032

Airport 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032

Approval –0.067 0.000 –0.105 0.000 –0.105 0.086 0.127 0.110 0.086 0.171 0.119 0.119

Area 0.000 1.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Continuation –0.147 0.000 –0.056 0.000 –0.056 0.493 0.409 –0.107 0.493 0.661 0.625 0.625

East –0.107 0.000 –0.145 0.000 –0.145 0.134 0.171 –0.042 0.134 0.270 0.196 0.196

Extension 0.028 0.000 0.196 0.000 0.196 0.326 0.173 0.055 0.326 0.267 0.359 0.359

Fence –0.143 0.000 0.028 0.000 0.028 0.933 0.819 1.038 0.933 1.057 1.071 1.071

Gate –0.067 0.000 –0.105 0.000 –0.105 0.086 0.127 0.110 0.086 0.171 0.119 0.119

Mobilization 0.000 1.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

North –0.107 0.000 –0.145 0.000 –0.145 0.134 0.171 –0.042 0.134 0.270 0.196 0.196

Office 0.440 0.000 0.954 0.000 0.954 0.210 –0.278 0.031 0.210 –0.397 0.069 0.069

Old 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032

Re–mobilization 0.089 0.000 0.176 0.000 0.176 0.014 –0.067 0.044 0.014 –0.107 –0.020 –0.020

Relocation 0.351 0.000 0.779 0.000 0.779 0.196 –0.211 –0.013 0.196 –0.290 0.089 0.089

Site 0.339 0.000 1.063 0.000 1.063 0.881 0.200 –0.022 0.881 0.368 0.877 0.877

Stop 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032

Temporary –0.067 0.000 –0.105 0.000 –0.105 0.086 0.127 0.110 0.086 0.171 0.119 0.119

Work 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032


References

[1] M. Al Qady, A. Kandil, Document management in construction—practices and opin-ions, Journal of Construction Engineering and Management 139 (10) (2013)06013002-1–06013002-7.

[2] C.H. Caldas, L. Soibelman, J. Han, Automated classification of construction projectdocuments, Journal of Computing in Civil Engineering 16 (4) (2002) 234–243.

[3] C.H. Caldas, L. Soibelman, Automating hierarchical document classification for con-struction management information systems, Automation in Construction 12 (4)(2003) 395–406.

[4] W.B. Frakes, R. Baeza-Yates, Information Retrieval: Data Structure and Algorithms,Prentice Hall, 1992.

[5] C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval,Cambridge University Press, New York, 2008.

[6] S. Saitta, P. Kripakaran, B. Raphael, I.F. Smith, Improving system identification usingclustering, Journal of Computing in Civil Engineering 22 (5) (2008) 292–302.

[7] T. Cheng, J. Teizer,Modeling tower craneoperator visibility tominimize the risk of limitedsituational awareness, Journal of Computing in Civil Engineering (Dec. 14 2012), http://dx.doi.org/10.1061/(ASCE)CP.1943-5487.0000282 (Epub).

[8] H.S. Ng, A. Toukourou, L. Soibelman, Knowledge discovery in a facility condition as-sessment database using text clustering, Journal of Infrastructure Systems 12 (1)(2006) 50–59.

[9] O. Raz, R. Buchheit, M. Shaw, P. Koopman, C. Faloutsos, Detecting semantic anoma-lies in truckweigh-in-motion traffic data using data mining, Journal of Computing inCivil Engineering 18 (4) (2004) 291–300.

[10] W. Guo, L. Soilbelman, J.H. Garrett Jr., Visual pattern recognition supporting defectreporting and condition assessment of wastewater collection systems, Journal ofComputing in Civil Engineering 23 (3) (2009) 160–169.

[11] S. Lee, L. Chang, Digital image processing methods for assessing bridge painting rustdefects and their limitations, Proc. of the International Conference on Computing inCivil Engineering, American Society of Civil Engineers, Cancun, Mexico, 2005.

[12] I. Brilakis, L. Soibelman, Y. Shinagwa, Material-based construction site image retriev-al, Journal of Computing in Civil Engineering 19 (4) (2005) 341–355.

[13] J. Gong, C.H. Caldas, Learning and classifying motions of construction workers andequipment using bag of video feature words and Bayesian learning methods, Proc.of the International Workshop on Computing in Civil Engineering, American Societyof Civil Engineers, Miami, Florida, United States, 2011.

[14] V. Escorcia, M. Dávila, M. Golparvar-Fard, J. Niebles, Automated vision-based recog-nition of construction worker actions for building interior construction operationsusing RGBD cameras, Proc. of the Construction Research Congress 2012, AmericanSociety of Civil Engineers, West Lafayette, Indiana, United States, 2012.

[15] M. Al Qady, A. Kandil, Automatic document classification using a successively evolv-ing dataset, Proc. of the 2011 3rd International/9th Construction Specialty Confer-ence, Curran Associates, Inc., Ottawa, Ontario, Canada, 2011.

[16] T.K. Landauer, P.W. Foltz, D. Laham, Introduction to latent semantic analysis, Dis-course Process, 25 (2&3) (1998) 259–284.

[17] M. Al Qady, A. Kandil, Automatic classification of project documents based on textcontent, Journal of Computing in Civil Engineering (June 20 2013), http://dx.doi.org/10.1061/(ASCE)CP.1943-5487.0000338 (Epub).

http://refhub.elsevier.com/S0926-5805(14)00031-4/rf0005














http://dx.doi.org/10.1061/(ASCE)CP.1943-5487.0000282




























http://dx.doi.org/10.1061/(ASCE)CP.1943-5487.0000338

Automatic clustering of construction project documents based on textual similarity

Documents

Transcript of Automatic clustering of construction project documents based on textual similarity