Research Article On the Use of Self-Organizing Map...

12
Research Article On the Use of Self-Organizing Map for Text Clustering in Engineering Change Process Analysis: A Case Study Massimo Pacella, Antonio Grieco, and Marzia Blaco Dipartimento di Ingegneria dell’Innovazione, Universit` a del Salento, 73100 Lecce, Italy Correspondence should be addressed to Marzia Blaco; [email protected] Received 17 March 2016; Accepted 30 October 2016 Academic Editor: Jens Christian Claussen Copyright © 2016 Massimo Pacella et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In modern industry, the development of complex products involves engineering changes that frequently require redesigning or altering the products or their components. In an engineering change process, engineering change requests (ECRs) are documents (forms) with parts written in natural language describing a suggested enhancement or a problem with a product or a component. ECRs initiate the change process and promote discussions within an organization to help to determine the impact of a change and the best possible solution. Although ECRs can contain important details, that is, recurring problems or examples of good practice repeated across a number of projects, they are oſten stored but not consulted, missing important opportunities to learn from previous projects. is paper explores the use of Self-Organizing Map (SOM) to the problem of unsupervised clustering of ECR texts. A case study is presented in which ECRs collected during the engineering change process of a railways industry are analyzed. e results show that SOM text clustering has a good potential to improve overall knowledge reuse and exploitation. 1. Introduction e development of complex products, such as trains or automobiles, involves engineering changes that frequently require redesigning or altering the products and their com- ponents. As defined by Jarratt et al. [1] “engineering change is an alteration made to parts, drawings or soſtware that have already been released during the design process. e change can be of any size or type, can involve any number of people and can take any length of time.” A change may encompass any modification to the form, fit, and/or function of the product as a whole or in part, materials, and may alter the interactions and dependencies of the constituent elements of the product. A change can be needed to solve quality problems or to meet new customer requirements. Although engineering change management was historically seen as a typical design and manufacturing research field, several contributions highlighted the effect of engineering change on other business processes such as material require- ment planning [2] and enterprise resource planning [3, 4]. An overview of the engineering change process and a big picture of literature on engineering change management are provided, respectively, by Jarratt et al. [5] and Hamraz et al. [6]. e engineering change request (ECR) is the document which initiates the engineering change process. ECR is used to describe a required change or a problem which may exist in a given product. Aſter the ECR, the impact of a change is discussed among involved stakeholders and the best possible solution is identified. Once the implementation of a change is completed, too oſten ECRs are no longer consulted by who could benefit from them. However, reviewing the ECR documents could offer a chance to improve both the design of a product and the engineering change process. A change may be a chance to both improve the product and do things “better next time” [9]. ECRs are documents containing structured and unstructured data, which, if analyzed, may be useful to discover information relating to recurring problems and solutions adopted in the past. As described in Hamraz et al. [6], a lot of literature concerns the prechange stage of the process and pro- poses methods to prevent or to ease the implementation of engineering changes before they occur. In contrast, the Hindawi Publishing Corporation Computational Intelligence and Neuroscience Volume 2016, Article ID 5139574, 11 pages http://dx.doi.org/10.1155/2016/5139574

Transcript of Research Article On the Use of Self-Organizing Map...

Page 1: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

Research ArticleOn the Use of Self-Organizing Map for Text Clustering inEngineering Change Process Analysis A Case Study

Massimo Pacella Antonio Grieco and Marzia Blaco

Dipartimento di Ingegneria dellrsquoInnovazione Universita del Salento 73100 Lecce Italy

Correspondence should be addressed to Marzia Blaco marziablacounisalentoit

Received 17 March 2016 Accepted 30 October 2016

Academic Editor Jens Christian Claussen

Copyright copy 2016 Massimo Pacella et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

In modern industry the development of complex products involves engineering changes that frequently require redesigning oraltering the products or their components In an engineering change process engineering change requests (ECRs) are documents(forms) with parts written in natural language describing a suggested enhancement or a problem with a product or a componentECRs initiate the change process and promote discussions within an organization to help to determine the impact of a changeand the best possible solution Although ECRs can contain important details that is recurring problems or examples of goodpractice repeated across a number of projects they are often stored but not consulted missing important opportunities to learnfrom previous projects This paper explores the use of Self-Organizing Map (SOM) to the problem of unsupervised clustering ofECR texts A case study is presented in which ECRs collected during the engineering change process of a railways industry areanalyzed The results show that SOM text clustering has a good potential to improve overall knowledge reuse and exploitation

1 Introduction

The development of complex products such as trains orautomobiles involves engineering changes that frequentlyrequire redesigning or altering the products and their com-ponents As defined by Jarratt et al [1] ldquoengineering changeis an alteration made to parts drawings or software thathave already been released during the design process Thechange can be of any size or type can involve any numberof people and can take any length of timerdquo A change mayencompass any modification to the form fit andor functionof the product as a whole or in part materials and mayalter the interactions and dependencies of the constituentelements of the product A change can be needed to solvequality problems or to meet new customer requirementsAlthough engineering change management was historicallyseen as a typical design and manufacturing research fieldseveral contributions highlighted the effect of engineeringchange on other business processes such as material require-ment planning [2] and enterprise resource planning [3 4]An overview of the engineering change process and a bigpicture of literature on engineering change management are

provided respectively by Jarratt et al [5] and Hamraz et al[6]

The engineering change request (ECR) is the documentwhich initiates the engineering change process ECR is usedto describe a required change or a problem which may existin a given product After the ECR the impact of a change isdiscussed among involved stakeholders and the best possiblesolution is identified

Once the implementation of a change is completed toooften ECRs are no longer consulted by who could benefitfrom them However reviewing the ECR documents couldoffer a chance to improve both the design of a productand the engineering change process A change may be achance to both improve the product and do things ldquobetternext timerdquo [9] ECRs are documents containing structuredand unstructured data which if analyzed may be usefulto discover information relating to recurring problems andsolutions adopted in the past

As described in Hamraz et al [6] a lot of literatureconcerns the prechange stage of the process and pro-poses methods to prevent or to ease the implementationof engineering changes before they occur In contrast the

Hindawi Publishing CorporationComputational Intelligence and NeuroscienceVolume 2016 Article ID 5139574 11 pageshttpdxdoiorg10115520165139574

2 Computational Intelligence and Neuroscience

Winner node(BMU)

Bidimensional output space

Multidimensional input space

Neighborhood NcxjΔmi(t)

mi(t)

c(xj)

Figure 1 SOM training algorithm Each neuron has a prototype vectorm119894 which corresponds to a point in the embedding input space Aninput vector x119895 will select that neuron with closestm119894 to it Adjustment of the weight vector for the winning output neuron and its neighborsthrough quantity Δm119894 Adapted from Ritter et al [7]

postchange stage involves less publication and deals withthe ex post facto exploration of effect of implementedengineering changes The analysis of engineering changesprocess belongs to the postchange stage and there are only fewapproaches concerning the analysis of engineering changesdata in complex products industry In this context one of themain challenges is dealing with free-form text contained inengineering changes documents which makes the data moredifficult to query search and extract This paper focuses onunstructured data contained in ECRs and proposes the textclustering for the postchange analysis of engineering changeprocess

Text clustering is an unsupervised learningmethodwheresimilar documents are grouped into clusters The goal isto create clusters that are coherent internally but clearlydifferent from each other Among the clustering methodsproposed in the literature Self-Organizing Map (SOM) hasattracted many researchers in recent years SOM is a neural-network model and algorithm that implements a characteris-tic nonlinear projection from the high-dimensional space ofinput signals onto a low-dimensional regular grid which canbe effectively utilized to visualize and explore properties ofthe data [10] With respect to other text clustering methodsSOM allows visualizing the similarity between documentswithin the low-dimensional grid Hence similar documentsmay be found in neighboring regions of the grid

In the literature text mining methods have been pro-posed in support of the engineering change process by Sharafiet al [11] Elezi et al [12] and Sharafi [13] In particularSharafi et al [11] focused on the causes of changes containedin ECRs and calculated term occurrences for all ECRs inorder to analyze occurrences of the keywords in differentprojects and to find pattern in the data Elezi et al [12]employed a semiautomatic text mining process to classifythe causes of iteration in engineering changes As a resultcost and technical categories of causes were identified asthe main reasons for the occurrence of iterations Sharafi[13] applied Knowledge Discovery in Database methods toanalyze historical engineering changes data in order to gain

insights in form of patterns within the database In detail apart of the study concerned the application of 119870-means 119870-Medoids DBSCAN and Support Vector Clustering methodsto cluster ECRs documents of an automotive manufacturer

This paper explores the use of SOM to the problem ofunsupervised clustering of ECR documents A case study ispresented and ECRs collected during the engineering changeprocess of a railway industry are analyzed The results showthat SOM text clustering has great potential to improveoverall knowledge reuse and exploitation in an engineeringchange process

The reminder of the paper is organized as follows InSection 2 the basic concepts of the SOM theory are intro-duced In Section 3 the SOM text based clustering methodis described In Section 4 the engineering change processin industry is described In Section 5 the case study andthe experimental results are both discussed In Section 6conclusions are given

2 The SOM Algorithm

The SOM originally proposed by Kohonen [14] is basedon the idea that systems can be designed to emulate thecollective cooperation of the neurons in the human brain Itis an unsupervised machine learning method widely used indatamining visualization of complex data image processingspeech recognition process control diagnostics in industryand medicine and natural language processing [15]

The algorithm of SOM consists in mapping 119872-dimensional input vectors x119895 to two-dimensional 119899119890119906119903119900119899119904or 119898119886119901119904 according to their characteristic features It reducesthe dimensions of data to a map helps to understandhigh-dimensional data and groups similar data together Asimple SOM consists of two layers The first includes nodesin the input space and the second the nodes in the outputspace A representation of SOM with output nodes in atwo-dimensional grid view is provided in Figure 1 SOMconsists of 119875 units each unit of index 119894 is associated with an119872-dimensional prototype vectorm119894 in the input space and a

Computational Intelligence and Neuroscience 3

position vector on a low-dimensional regular grid r119894 in theoutput space The steps of the SOM learning algorithm are asfollows

(1) Initialization Start with initial values of prototypevectors m119894 In the absence of any prior informationvalues of prototype vectorm119894 can be random or linearand are adjusted while the network learns

(2) Sampling Select a vector x119895 from the training inputspace The selection of x119895 can be random

(3) MatchingDetermine the Best Matching Unit (BMU)Vector x119895 is compared with all the prototype vectorsand the index 119888(x119895) of the BMU that is the prototypevectorm119888 which is closest to x119895 is chosen accordinglyto the smallest Euclidian distance as follows10038171003817100381710038171003817x119895 minus m119888

10038171003817100381710038171003817 = min119894

10038171003817100381710038171003817x119895 minus m11989410038171003817100381710038171003817 (1)

(4) UpdatingUpdate the BMU and its neighbors Adjust-ment of the prototype vector for the winning outputneuron and its neighbors are updated as

m119894 (119905 + 1) = m119894 (119905) + Δm119894 (119905) (2)

where 119905 = 0 1 2 is an index of the time The valueof Δm119894(119905) in (2) is computed as follows

Δm119894 (119905) = 120572 (119905) ℎ119888119894 (119905) (x119895 (119905) minus m119894 (119905)) (3)

where 120572(119905) is the learning-rate factor and ℎ119888119894(119905) theneighborhood function In particular the learning-rate factor 120572(119905) is comprised in [0 1] and is mono-tonically decreasing during the learning phase Theneighborhood function ℎ119888119894(119905) determines the distancebetween nodes of indexes 119888 and 119894 in the output layergrid A widely applied neighborhood kernel can bewritten in terms of the Gaussian function

ℎ119888119894 (119905) = exp(minus1003817100381710038171003817r119888 minus r1198941003817100381710038171003817221205902 (119905) ) (4)

where r119888 and r119894 are the position vectors of nodes 119888and 119894 and the parameter 120590(119905) defines the width ofthe kernel which corresponds to the radius of theneighborhood 119873119888(119905) 119873119888(119905) refers to a neighborhoodset of array points around node of index 119888 (Figure 1)The value ℎ119888119894(119905) decreases during learning from aninitial value often comparable to the dimension of theoutput layer grid to a value equal to one

During the learning of the SOM phases 2ndash4 are repeatedfor a number of successive iterations until the prototypevectorsm119894 represent as much as possible the input patternsx119895 that are closer to the neurons in the two-dimensional mapAfter initialization the SOM can be trained in a sequentialor batch manner [8] Sequential training is repetitive asbatch training but instead of sending all data vectors to themap for weight adjustment one data vector at a time is

sent to the network Once the SOM is trained each inputvector is mapped to one neuron of the map reducing high-dimensional input space to a low-dimensional output spaceThe map size depends on the type of application The biggersize map reveals more details of information whereas asmaller map is being chosen to guarantee the generalizationcapability

Before application SOMmethod requires predefining thesize and structure of the network the neighborhood functionand the learning function These parameters are generallyselected on the basis of heuristic information [7 8 16 17]

21 SOM Cluster Visualization The SOM is an extremelyversatile tool for visualizing high-dimensional data in lowdimension For visualization of SOM both the unified-distance matrix (U-matrix) [18] and Component Planes[19] are used The U-matrix calculates distances betweenneighboring map units and these distances can be visualizedto represent clusters using a color scale on the map

TheU-matrix technique is a single plot that shows clusterborders according to dissimilarities between neighboringunits The distance ranges of U-matrix visualized on themap are represented by different colors (or grey shades) Redcolors correspond to large distances that is large gaps existbetween the prototype vector values in the input space bluecolors correspond to small distance that is map units arestrongly clustered together U-matrices are useful tools forvisualizing clusters in input data without having any a prioriinformation about the clusters

Another important tool of visualization is ComponentPlanes that is a grid whose cells contain the value of the119901th dimension of a prototype vector displayed by variationof color It helps to analyze the contribution of each variableto cluster structures and the correlation between the differentvariables in the dataset

22 SOM Clustering Using 119870-Means Algorithm One of thedrawbacks of SOM analysis is that unlike other clustermethods the SOM has no distinct cluster boundaries Whendatasets become more complex it is not easy to distinguishthe cluster by pure visualization As described in Vesanto andAlhoniemi [10] in SOM the prototype nodes can be used forclustering instead of all input dataset LetC1 C119896 C119870denote a cluster partition composed of119870 clustersThe choiceof the best clustering can be determined by applying the well-known 119870-means algorithm [20] This algorithm minimizesan error function computed on the sum of squared distancesof each data point in each cluster The algorithm iterativelycomputes partitioning for the data and updates cluster centersbased on the error function In this approach the numberof clusters 119870 has to be fixed a priori Therefore 119870-meansalgorithm is run multiple times for each 119870 isin [2radic119873] where119873 is number of samples The best number of clusters 119870lowast canbe selected based on the Davies Bouldin Index (DBI) [21]This index is based on a ratio of within-cluster and between-cluster distances and is calculated as

DBI (119870) = 1119870119870sum119896=1

max119896 =119897

Δ (C119896) + Δ (C119897)120575 (C119896C119897) (5)

4 Computational Intelligence and Neuroscience

where 119870 is the number of clusters Δ(C119896) and Δ(C119897) and120575(C119896C119897) the within-cluster and between-cluster distancesrespectively The optimum number of clusters 119870lowast corre-sponds to the minimum value of DBI(119870) SOM neuralnetwork combined with other clustering algorithms was usedin Yorek et al [22] for visualization of studentsrsquo cognitivestructural models

3 SOM-Based Text Clustering

Text clustering is an unsupervised process used to separatea document collection into some clusters on the basis of thesimilarity relationship between documents in the collection[17] Suppose C = 1198891 119889119895 119889119873 be a collection of 119873documents to be clustered The purpose of text clustering isto divide C into C1 C119896 C119870 clusters with C1 sdot sdot sdot cupC119896 sdot sdot sdot cup C119870 = C

SOM text clustering can be divided into two main phases[23 24] The first phase is document preprocessing whichconsists in using Vector Space Model (VSM) to generateoutput document vectors from input text documents Thesecond one is document clustering that applies SOM on thegenerated document vectors to obtain output clusters

31 Document Preprocessing An important preprocessingaspect for text clustering is to consider how the text contentcan be represented in the form of mathematical expressionfor further analysis and processing

By means of VSM each 119889119895 (119895 = 1 119873) can berepresented as vector in119872-dimensional space In detail eachdocument119889119895 can be represented by a numerical feature vectorx119895

x119895 = [1199081119895 119908119901119895 119908119872119895] (6)

Each element 119908119901119895 of the vector usually represents a word (ora group of words) of the document collection that is the sizeof the vector is defined by the number of words (or groups ofwords) of the complete document collection

The simplest approach is to assign to each document theTerm Frequency and Inverse Document Frequency (TF-IDF)weighting scheme [25 26] The TF-IDF weighting schemeassigns to each term 119901 in the 119895th document a weight 119908119901119895computed as

119908119901119895 = 119905119891119901119895 times log 119873119889119891119901 (7)

where 119905119891119901119895 is the term frequency that is the number of timesthat term 119901 appears in the document 119895 and 119889119891119901 is the numberof documents in the collection which contains term 119901

According to TF-IDF weighting scheme 119908119901119895 is(1) higher when the term 119901 occurs many times within

a small number of documents (thus lending highdiscriminating power to those documents)

(2) lower when the term 119901 occurs fewer times in a doc-ument or occurs in many documents (thus offering aless pronounced relevance signal)

(3) lower when the term 119901 occurs in virtually all docu-ments

Before preprocessing the documents by the TF-IDFweighting scheme The size of the list of terms created fromdocuments can be reduced using methods of stop wordsremoval and stemming [23 24]

In text based document in fact there are a great numberof noninformative words such as articles prepositions andconjunctions called stop words A stop-list is usually builtwith words that should be filtered in the document represen-tation process Words that are to be included in the stop-listare language and task dependent however a set of generalwords can be considered stop words for almost all tasks suchas ldquoandrdquo and ldquoorrdquo Words that appear in very few documentsare also filtered

Another common phase in preprocessing is stemmingwhere the word stem is derived from the occurrence ofa word by removing case and inflection information Forexample ldquocomputesrdquo ldquocomputingrdquo and ldquocomputerrdquo are allmapped to the same stem ldquocomputrdquo Stemming does notalter significantly the information included in documentrepresentation but it does avoid feature expansion

32 SOM Text Clustering Once obtaining the feature vectorx119895 in (6) associated with each text 119889119895 the SOM algorithmdescribed in Section 2 can be applied for text clustering Thetext clustering method explained above is known as ldquoSOMplus VSMrdquo other variants to it have been proposed by Liuet al [27 28] An overview of the application of SOM intext clustering is provided by Liu et al [17] This kind ofclustering method was employed in domains such as patent[29] financial services [30] and public policy analysis [31]

4 The Engineering Change Process inComplex Products Industry

For complex products such as trains automobiles or air-craft engineering changes are unavoidable and productsor components have to be redesigned and retrofitted toaccommodate the new changes to new installations andproducts In these environments an engineering change caninvolve the risk of due time delay Huang et al [32] carriedout a survey about the effects of engineering changes on fourmanufacturing industries and found that the time investedin processing an engineering change varies from 2 to 36person days In Angers [33] it is estimated that more than35 of todayrsquos manufacturing resources are just devoted tomanaging changes to engineering drawings manufacturingplans and scheduling requirements Engineering changeprocesses in complex environments such as automobile trainand aviation industry were also studied by Leng et al [3] andSubrahmanian et al [34]

The phases of a real engineering change process incomplex products industry can be summarized as follows(Figure 2)

(1) A request of an engineering change is made and sentto an engineering manager In this phase standardECR forms are used outlining the reason of thechange the type of the change which components or

Computational Intelligence and Neuroscience 5

Change trigger

(2)Identification of possible

solution(s) to change request

(3)Technical evaluation of the

change

(4)Economic evaluation of the

change

(5)Selection and approval of a

solution

(6)Implementation of solution

(7)Updating as-built documents

(1)Engineering change request

Before approval

During approval

After approval

Figure 2The real engineering change process in complex productsenvironments Adapted from Jarratt et al [5]

systems are likely to be affected the person and thedepartment making the request and so forth

(2) Potential solutions to the request for change areidentified

(3) Technical evaluation of the change is carried out Inthis phase the technical impact of implementing eachsolution is assessed Various factors are consideredfor example the impact upon design and productrequirements production schedule and resources tobe devoted

(4) Economic evaluation of the change is performedThe economic risk of implementing each solutionis assessed In this phase costs related to extraproduction times replacements of materials penaltyfor missed due date and so forth are estimated

(5) Once a particular solution is selected it is approvedor not approved The change is reviewed and a costbenefit analysis is carried out When a solution isapproved the engineering change order is preparedand issued

(6) Implementation of the engineering change and iden-tification of the documents such as drawings are tobe updated

(7) Update of the as-built documents occurs As-builtdocuments are usually the original design documentsrevised to reflect any changes made during the pro-cess that is design changes material changes and soforth

Iterations of the process occur for example when a par-ticular solution has negative impact on product requirementsor is too risky to be implemented so the process returns tophase 2 and another solution is identified Another iterationis possible when the costs of a solution are too high or morerisk analysis is required or when the proposed solution iscompletely refused

As shown in Figure 2 no review process of similarchanges faced in the past is carried out during the processor at the endThis aspect is emphasized by Jarratt et al [5] byhighlighting that after a period of time the change should bereviewed to verify if it achieved what was initially intendedand what lessons can be learned for future change processVarious factors can discourage examining the solutionsadopted in the past to a particular change First of all thereis the lack of opportune methods to analyze the documentscollected during the process that is ECR ECRs are oftenforms containing partswritten in natural languageAnalyzingthese kinds of documents in the design phase of a product ora component or when a new change request occurs could bevery time consuming without an appropriate solution

In this context SOM text clustering application canimprove the process When a new ECR occurs in fact ECRsmanaged in the past and similar to the current request couldbe analyzed in order to evaluate the best solution and toavoid repeating the same mistakes made in the past In orderto explore the existence of similarity between the differentECRs texts the first step is to verify the potential clusteringpresent in the analyzed dataset The application of SOM textclustering to ECR documents is explored in the next section

5 The Use of SOM Text Clusteringfor ECRs Analysis

In order to test SOM text clustering we used a datasetof 119873 = 54 ECRs representing some engineering changesmanaged during the engineering change process of a railwaycompany The dataset included the natural language writtendescriptions of the causes of changes contained in the ECRsforms The documents were written in Italian language andthe VSM in Section 3 was used to generate output vectorsfrom input text documents The number of terms that is thedimension of each vector x119895 associated with each document

6 Computational Intelligence and Neuroscience

05

11

17

Figure 3 U-matrix of the 54 ERCs texts preprocessed throughthe VSM Color scale is related to distances between map unitsRed colors represent large distances and blue colors represent smalldistances

in the dataset after the stop word removal and stemmingprocesses was equal to 119872 = 361

In our work we used the MATLAB software SpecificallyTerm toMatrix Generator (TMG) toolbox [35] for documentpreprocessing and SOM toolbox for SOM text clustering [8]were employed

The map size of the SOM is computed through theheuristic formula in Vesanto et al [8] In detail the numberof neurons is computed as 5radic119873 where 119873 is the numberof training samples Map shape is a rectangular grid withhexagonal lattice The neighborhood function is Gaussianand the map is trained using the batch version of SOM Afterthe map is trained each data vector is mapped to the mostsimilar prototype vector in the map that is the BMU whichresults from the matching step in the SOM algorithm In ourcase the network structure is a 2D-lattice of 7 times 5 hexagonal

A first step in cluster analysis based on SOM is based onvisual inspection through U-matrix and Component Planes

The U-matrix obtained on the application of SOM toECR dataset is shown in Figure 3 In order to representadditional information (ie distances) the SOM map size isaugmented by inserting an additional neuron between eachpair of neurons and reporting the distance between the twoneurons The U-matrix and the different colors linked to thedistance between neurons on the map show that five clustersare present in the dataset Each cluster has been highlightedby a circle

Besides looking at the overall differences in the U-matrixit is interesting as well to look at the differences betweeneach component present in the input vectors meaning thatwe look at differences regarding each component associatedwith a single ldquotermrdquo in the input dataset The total numberof Component Planes obtained from the SOM correspondsto the total number of terms in our dataset that is 119872 =361 For illustrative purpose Figure 4 shows two ComponentPlanes chosen as an example The first Component Plane

Table 1 Class label and number of ECR texts

Class Label Number of ECR texts(1) Metal-Sheet MS 12(2) Carter CR 8(3) Antenna AN 8(4) Semi-Finished Round SR 7(5) Hydraulic Panel HP 9(6) Pneumatic System PS 10

(Figure 4(a)) is associated with the term of index 119901 = 48that is the term ldquoAntennardquo the second one (Figure 4(b)) isrelated to the term of index 119901 = 195 that is the term ldquoMetal-Sheetrdquo The difference between the two terms in the datasetcan be represented by considering for example the map unitin top left corner of the two figures This map unit has highvalues for variable ldquoTerm 119901 = 48rdquo and low values for variableldquoTerm119901 = 195rdquo From the observation of Component Planeswe can conclude that there is no correlation between the twoterms As a matter of fact these two terms were never usedtogether into the same documents

As shown above theU-matrix and theComponent Planesallow obtaining a rough cluster structure of the ECR datasetTo get a finer clustering result the prototype vectors fromthe map were clustered using 119870-means algorithm The bestnumber of clusters in the SOM map grip can be determinedby using the DBI values Figure 5(a) shows the DBI values byvarying the number of clusters 119870 in [2 7] The elbow pointin the graph shows that the optimum number of cluster 119870lowastcorresponding to theminimum value of DBI(119870) is equal to 5The clustering result of SOMobtained by using119870-meanswith119870lowast = 5 clusters is shown in Figure 5(b)The BMUs belongingto each cluster are represented with a different color and ineach hexagon the number of documents associated with eachBMU is provided

51 External Validation of SOM Text Clustering In thereference case study the true classification of each ECR inthe dataset was provided by process operators Thereforethis information was used in our study in order to per-form an external validation of the SOM text clustering Inparticular each ECR text was analyzed and classified withreference to themain component involved in the engineeringchange (namely ldquoMetal-Sheetrdquo ldquoCarterrdquo ldquoAntennardquo ldquoSemi-Finished Roundrdquo ldquoHydraulic Panelrdquo and ldquoPneumatic Sys-temrdquo) Table 1 reports the number of ECRs related to eachcomponent along with the labels used in order to classify theECR documents (namely ldquoMSrdquo ldquoCRrdquo ldquoANrdquo ldquoSRrdquo ldquoHPrdquo andldquoPSrdquo respectively) Although a classification is available in thespecific case study of this paper it is worth noting that oftensuch information may be unavailable

By superimposing the classification information themap grid resulting by the training of the SOM is as inFigure 6 where each hexagon reports the classification labelof documents sharing a given BMU (within brackets thenumber of associated documents) From Figure 6 it can beobserved that the unsupervised clustering produced by theSOMalgorithm is quite coherent with the actual classification

Computational Intelligence and Neuroscience 7

Term p = 48

(a)d

0

05

1

Term p = 195

(b)

Figure 4 Component Planes of two variables chosen as example (a) Component Plane related to the variable of index 119901 = 48 correspondingto the term ldquoAntennardquo (b) Component Plane of the variable of index 119901 = 195 corresponding to the term ldquoMetal-Sheetrdquo (a) and (b) are linkedby position in each figure the hexagon in a certain position corresponds to the same map units Each Component Plane shows the values ofthe variable in each map unit using color-coding Letter ldquodrdquo means that denormalized values are used that is the values in the original datacontext are used [8]

1 2 3 4 5 6 708

09

1

11

12

13

14

15

16

17

18

K

DBI

(K)

(a)

5

3

3

6

1

2

1

2

3

1

1

1

1

2

2

1

1

6

4

3

3

2

(b)

Figure 5 (a) Davies Bouldin Index (DBI) values for number of clusters119870 in [2 7] (b) Clustering result of SOM using119870-means with119870lowast = 5clusters In each hexagon the number of documents sharing the same BMU is provided

given by process operators in fact ECR documents sharingthe same classification label are included in BMUs belongingto the same cluster It is worth noting that the actual labelis assigned to each document after the SOM unsupervisedtraining has been carried out From Figure 6 it can be also

noted that ECRs classified either as ldquoPSrdquo and ldquoHPrdquo are allincluded in a unique cluster Furthermore two documentsout of twelve with label ldquoMSrdquo are assigned to clusters thatinclude different labels namely ldquoCRrdquo ldquoPSrdquo and ldquoHPrdquo Weinvestigated this misclassification and we found that causes

8 Computational Intelligence and Neuroscience

AN(5)

CR(3)

CR(3)

SR(6)

AN(1)

CR(2)

SR(1)

AN(2)

PS(3)

HP(1)

MS(1)

MS(1)

PS(1)

HP(2)

MS(2)

MS(1)

MS(1)

PS(6)

HP(4)

HP(2)MS(1)

MS(3)

MS(2)

Figure 6 Labeled map grid of SOM In each hexagon labels andtotal number of documents sharing the same BMU Coloring refersto 119870-means unsupervised clustering of the SOM

Table 2 Contingency matrix N

MS CR AN SR HP PS 119899119903T1 T2 T3 T4 T5 T61198621 0 0 8 0 0 0 81198622 1 0 0 0 9 10 201198623 1 8 0 0 0 0 91198624 0 0 0 7 0 0 71198625 10 0 0 0 0 0 10119898119896 12 8 8 7 9 10 119873 = 54

of changes described in these two ldquoMSrdquo documents are quitesimilar to those contained in documents labeled as ldquoCRrdquoldquoPSrdquo and ldquoHPrdquo

Given the actual classification the quality of the obtainedclustering can be evaluated by computing four indices purityprecision recall and 119865-measure [36]

Let T = T1 T119903 T119877 be the true partitioninggiven by the process operators where the partition T119903consists of all the documents with label 119903 Let 119898119903 = |T119903|denote the number of documents in true partition T119903 Alsolet C = C1 C119896 C119870 denote the clustering obtainedvia the SOM text clustering algorithm and 119899119896 = |C119896|denote the number of documents in cluster C119896 The 119870 times 119877contingency matrix N induced by clustering C and the truepartitioning T can be obtained by computing the elementsN(119896 119903) = 119899119896119903 = |C119896 cap T119903| where 119899119896119903 denotes the number ofdocuments that are common to clusterC119896 and true partitionT119903 The contingency matrix for SOM text clustering of ECRsis reported in Table 2

Starting fromTable 2 the following indices are computed

(i) Purity Index The cluster-specific purity index ofclusterC119896 is defined as

purity119896 = 1119899119896119877max119903=1

119899119896119903 (8)

The overall purity index of the entire clustering C iscomputed as

purity = 119870sum119896=1

119899119896119873purity119896 = 1119873119870sum119896=1

119877max119903=1

119899119896119903 (9)

As shown in Table 3 clusters C1 C4 and C5 havepurity index equal to 1 that is they contain entitiesfrom only one partition Clusters C2 and C3 gatherentities from different partitions that isT1T5 andT6 for cluster C2 and T1 T2 for cluster C3 Theoverall purity index is equal to 079

(ii) Precision Index Given a clusterC119896 letT119903119896 denote themajority partition that contains the maximum num-ber of documents fromC119896 that is 119903119896 = argmax119877119903=1119899119896119903The precision index of a clusterC119896 is given by

prec119896 = 1119899119896119877max119903=1

119899119896119903 = 119899119896119903119896119899119896 (10)

For clustering in Table 2 the majority partitionsare T31 T62 T23 T44 and T15 Precision indicesshow that all documents gathered in clusters C1C4 and C5 belong to the corresponding majoritypartitionsT31 T44 andT15 For clusterC2 the 50of documents belong to T62 and finally the 88 ofdocuments in clusterC3 belong toT23

(iii) Recall Index Given a clusterC119896 it is defined as

recall119896 = 11989911989611990311989610038161003816100381610038161003816T119903119896 10038161003816100381610038161003816 = 119899119896119903119896119898119903119896 (11)

where 119898119903119896 = |T119903119896 | It measures the fraction ofdocuments in partition T119903119896 shared in common withclusterC119896The recall indices reported in Table 3 showthat clusters C1 C2 C3 and C4 shared in commonthe 100 of documents in majority partitions T31 T62 T23 andT44 respectively ClusterC5 shared the83 of documents inT15

(iv) 119865-Measure Index It is the harmonic mean of theprecision and recall values for each cluster The 119865-measure for cluster 119862119896 is therefore given as

119865119896 = 2 sdot prec119896 sdot recall119896prec119896 + recall119896

= 2119899119896119903119896119899119896 + 119898119903119896 (12)

The overall 119865-measure for the clustering C is themean of the clusterwise 119865-measure values

119865 = 1119870119870sum119896=1

119865119896 (13)

Table 3 shows that 119865-measure of clusters C1 and C4is equal to 1 while other values are less than 1 Thelow values of 119865-measures for clusters C2 C3 andC5 depend on a low precision index for clusters C2and C3 and on a low recall index for cluster C5Consequently the overall 119865-measure is equal to 090

Computational Intelligence and Neuroscience 9

Table 3 Clustering validation

Cluster-specific and overall indicesPurity

purity1 = 1 purity2 = 05 purity3 = 089 purity4 = 1 purity5 = 1purity = 079Precision

prec1 = 1 prec2 = 05 prec3 = 089 prec4 = 1 prec5 = 1Recall

recall1 = 1 recall2 = 1 recall3 = 1 recall4 = 1 recall5 = 083F-measure

1198651 = 1 1198652 = 066 1198653 = 094 1198654 = 1 1198655 = 091119865 = 090Table 4 Results of SOM-based classification In the first and in the second column labels and number of ECR texts of the actual classificationare reported In the third column the number of ECRs correctly classified through the label of first associated BMU In the fourth column thenumber of document associated with an empty first BMU but correctly classified by considering the label of documents sharing the secondassociated BMU

Labels Number of ECRs Number of ECRs classified through the 1st BMU Number of ECRs classified through the 2nd BMUMS 12 10 2CR 8 6 2AN 8 7 1SR 7 6 1HP 9 6 3PS 10 10 mdash

Given the actual classification the SOM can be furthervalidated through a leave-one-out cross validation techniquein order to check its classification ability In particular119873 minus 1ECR documents are used for training and the remaining onefor testing (iterating until each ECR text in the data has beenused for testing)

At each iteration once SOM has been trained on 119873 minus 1ECRs and when the testing sample is presented as input aBMU is selected in the matching step of the SOM algorithmThe label of the training documents associatedwith that BMUis considered In the case of an empty BMU that is whichresults are not associated with any training documents theclosest one associated with at least one training document isconsidered instead while in case of a BMU associated withtraining documents with more than one label the label withgreater number of documents is considered

Table 4 shows the results of the leave-one-out cross vali-dation For each row that is for a given ECR label the secondcolumn reports the total number of documents in the datasetwhile the last two columns report the number of testingECRs correctly classified by the SOM In particular thethird column reports the number of testing ECRs correctlyclassified as they were connected to a first BMU associatedwith training documents with the same labelThe last columnrefers to the number of ECRs associated with an empty firstBMU that nevertheless resulted closest to a second BMUrelated to documents belonging to the same class of thetesting sample Also cross validation study demonstrates thatlabels given by SOM are coherent with actual classificationand confirms the ability of SOM as classification tool

6 Conclusions

In this paper a real case study regarding the engineeringchange process in complex products industry was conductedThe study concerned the postchange stage of the engineeringchange process in which past engineering changes dataare analyzed to discover information exploitable in newengineering changes In particular SOM was used for clus-tering of natural language written texts produced during theengineering change process The analyzed texts included thedescriptions of the causes of changes contained in the ECRforms Firstly SOM algorithm was used as clustering tool tofind relationships between the ECR texts and to cluster themaccordingly Subsequently SOM was tested as classificationtool and the results were validated through a leave-one-outcross validation technique

The results of the real case study showed that the use of theSOM text clustering can be an effective tool in improvementof the engineering change process analysis In particularsome of the advantages highlighted in this study are asfollows

(1) Text mining methods allow analyzing unstructureddata and deriving high-quality informationThemaindifficulty in ECR analysis consisted in analyzingnatural language written texts

(2) Clustering analysis of past ECRs stored in the com-pany allows automatically gathering ECRs on thebasis of similarity between documents When a newchange triggers the company can quickly focus on

10 Computational Intelligence and Neuroscience

the cluster of interest Clustering can support thecompany to know if a similar change was alreadymanaged in the past to analyze the best solutionadopted and to learn avoiding the same mistakesmade in the past

(3) Use of SOM for ECRs text clustering allows automat-ically organizing large documents collection Withrespect to other clustering algorithms the mainadvantage of SOM text clustering is that the similarityof the texts is preserved in the spatial organization ofthe neurons The distance among prototypes in theSOMmap can therefore be considered as an estimateof the similarity between documents belonging toclusters In addition SOM can first be computedusing a representative subset of old input data Newinput can be mapped straight into the most similarmodel without recomputing the whole mapping

Nevertheless the study showed some limitations of theapplication of SOM text clustering and classification A firstlimitation is linked to natural language written texts Theterms contained in different texts may be similar even if anengineering change request concerns a different productThesimilarity of terms may influence the performance of SOM-based clustering A second limitation is linked to the use ofSOMas classificationmethod Classification indeed requiresthe labeling of a training datasetThis activity requires a deepknowledge of the different kinds of ECRs managed duringthe engineering change process andmay be difficult and timeconsuming

As a summary research on use of SOM text clusteringin engineering change process analysis appears to be apromising direction for further research A future directionof the work will consider the use of SOM text clustering ona larger dataset of ECRs comparing SOM with other clus-tering algorithms such as119870-means or hierarchical clusteringmethods Another direction of future research concerns theanalysis of SOM robustness to parameters selection (iethe size and structure of the map parameters and kinds oflearning and neighborhood functions)

Symbols

120572(119905) Scalar-valued learning-rate factorN Contingency matrix120575(C119896C119897) Between-cluster distanceΔ(C119896) Within-cluster distanceΔm119894(119905) Adjustment of the prototype vectorm119894(119905)C Collection of documentsC119896 Cluster of documents of index 119896T True partition of documents given by

process operatorsT119903 True partition of documents with label 119903T119903119896 Majority partition containing the

maximum number of documents fromC119896120590(119905) Width of the kernel which corresponds tothe radius of 119873119888(119905)119888(x119895) Index of the BMU associated with x119895119889119895 Document of index 119895

DBI(119870) Davies Bouldin Index for 119870 clusters119889119891119901 Number of documents in the collectionCwhich contains term 119901ℎ119888119894(119905) Neighborhood function119870 Total number of clusters119870lowast Optimum number of clusters119872 Total number of terms in the documentscollection119898119903 Number of documents in partitionT119903119873 Total number of processed documents119873119888(119905) Neighborhood around node 119888119899119896 Number of documents in clusterC119896119899119896119903 Number of documents common to clusterC119896 and partitionT119903119875 Total number of units of SOM119905 Index of time119905119891119901119895 Term frequency of term 119901 in the 119895thdocument119908119901119895 Weight associated with term 119901 in the 119895thdocument

m119888 Prototype vector associated with the BMUm119894 Prototype vector of index 119894r119888 Position vector of index 119888r119894 Position vector of index 119894x119895 Numerical feature vector associated with119895th document

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research has been funded by MIUR ITIA CNR andCluster Fabbrica Intelligente (CFI) Sustainable Manufactur-ing project (CTN01 00163 148175) and Vis4Factory project(Sistemi Informativi Visuali per i processi di fabbrica nelsettore dei trasporti PON02 00634 3551288)The authors arethankful to the technical and managerial staff of MerMecSpA (Italy) for providing the industrial case study of thisresearch

References

[1] T Jarratt J Clarkson and C E Eckert Design Process Improve-ment Springer New York NY USA 2005

[2] C Wanstrom and P Jonsson ldquoThe impact of engineeringchanges on materials planningrdquo Journal of Manufacturing Tech-nology Management vol 17 no 5 pp 561ndash584 2006

[3] S Leng L Wang G Chen and D Tang ldquoEngineering changeinformation propagation in aviation industrial manufacturingexecution processesrdquo The International Journal of AdvancedManufacturing Technology vol 83 no 1 pp 575ndash585 2016

[4] W-H Wu L-C Fang W-Y Wang M-C Yu and H-Y KaoldquoAn advanced CMII-based engineering change managementframework the integration of PLM and ERP perspectivesrdquoInternational Journal of Production Research vol 52 no 20 pp6092ndash6109 2014

Computational Intelligence and Neuroscience 11

[5] T A W Jarratt C M Eckert N H M Caldwell and P JClarkson ldquoEngineering change an overview and perspective onthe literaturerdquo Research in Engineering Design vol 22 no 2 pp103ndash124 2011

[6] B Hamraz N H M Caldwell and P J Clarkson ldquoA holisticcategorization framework for literature on engineering changemanagementrdquo Systems Engineering vol 16 no 4 pp 473ndash5052013

[7] H Ritter T Martinetz and K Schulten Neural Computationand Self-Organizing Maps An Introduction Addison-WesleyNew York NY USA 1992

[8] J Vesanto J Himberg E Alhoniemi and J Parhankangas SOMtoolbox for Matlab 5 2005 httpwwwcishutfisomtoolbox

[9] E Fricke BGebhardHNegele andE Igenbergs ldquoCopingwithchanges causes findings and strategiesrdquo Systems Engineeringvol 3 no 4 pp 169ndash179 2000

[10] J Vesanto and E Alhoniemi ldquoClustering of the self-organizingmaprdquo IEEE Transactions on Neural Networks vol 11 no 3 pp586ndash600 2000

[11] A Sharafi P Wolf and H Krcmar ldquoKnowledge discovery indatabases on the example of engineering change managementrdquoin Proceedings of the Industrial Conference on Data Mining-Poster and Industry Proceedings pp 9ndash16 IBaI July 2010

[12] F Elezi A Sharafi A Mirson P Wolf H Krcmar and ULindemann ldquoA Knowledge Discovery in Databases (KDD)approach for extracting causes of iterations in EngineeringChange Ordersrdquo in Proceedings of the ASME InternationalDesign Engineering Technical Conferences and Computers andInformation in Engineering Conference pp 1401ndash1410 2011

[13] A Sharafi Knowledge Discovery in Databases an Analysis ofChange Management in Product Development Springer 2013

[14] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo Biological Cybernetics vol 43 no 1 pp 59ndash691982

[15] T Kohonen ldquoEssentials of the self-organizing maprdquo NeuralNetworks vol 37 pp 52ndash65 2013

[16] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[17] Y-C Liu M Liu and X-L Wang ldquoApplication of self-organizing maps in text clustering a reviewrdquo in Applicationsof Self-Organizing Maps M Johnsson Ed chapter 10 InTechRijeka Croatia 2012

[18] A Ultsch and H P Siemon ldquoKohonenrsquos self organizing featuremaps for exploratory data analysisrdquo in Proceedings of theProceedings of International Neural Networks Conference (INNCrsquo90) pp 305ndash308 Kluwer Paris France 1990

[19] J Vesanto ldquoSOM-based data visualization methodsrdquo IntelligentData Analysis vol 3 no 2 pp 111ndash126 1999

[20] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 1967

[21] D L Davies and DW Bouldin ldquoA cluster separation measurerdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 1 no 2 pp 224ndash227 1979

[22] N Yorek I Ugulu and H Aydin ldquoUsing self-organizingneural network map combined with Wardrsquos clustering algo-rithm for visualization of studentsrsquo cognitive structural modelsabout aliveness conceptrdquoComputational Intelligence andNeuro-science vol 2016 Article ID 2476256 14 pages 2016

[23] T Honkela S Kaski K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo in Proceedingsof theWork shop on Self-OrganizingMaps (WSOM rsquo97) pp 310ndash315 Espoo Finland June 1997

[24] S Kaski T Honkela K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo Neurocomput-ing vol 21 no 1ndash3 pp 101ndash117 1998

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information ProcessingampManagement vol24 no 5 pp 513ndash523 1988

[26] G Salton and R K Waldstein ldquoTerm relevance weights in on-line information retrievalrdquo Information Processing amp Manage-ment vol 14 no 1 pp 29ndash35 1978

[27] Y C Liu X Wang and C Wu ldquoConSOM a conceptional self-organizingmapmodel for text clusteringrdquoNeurocomputing vol71 no 4ndash6 pp 857ndash862 2008

[28] Y-C Liu C Wu and M Liu ldquoResearch of fast SOM clusteringfor text informationrdquo Expert Systems with Applications vol 38no 8 pp 9325ndash9333 2011

[29] T Kohonen S Kaski K Lagus et al ldquoSelf organization of amassive document collectionrdquo IEEE Transactions on NeuralNetworks vol 11 no 3 pp 574ndash585 2000

[30] J Zhu and S Liu ldquoSOMnetwork based clustering analysis of realestate enterprisesrdquo American Journal of Industrial and BusinessManagement vol 4 no 3 pp 167ndash173 2014

[31] B C Till J Longo A R Dobell and P F Driessen ldquoSelf-organizing maps for latent semantic analysis of free-form textin support of public policy analysisrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 1pp 71ndash86 2014

[32] G Q Huang W Y Yee and K L Mak ldquoCurrent practice ofengineering changemanagement in Hong Kongmanufacturingindustriesrdquo Journal of Materials Processing Technology vol 139no 1ndash3 pp 481ndash487 2003

[33] S Angers ldquoChanging the rules of the lsquoChangersquo gamemdashem-ployees and suppliers work together to transform the 737757production systemrdquo 2002 httpwwwboeingcomnewsfron-tiersarchive2002mayi ca2html

[34] E Subrahmanian C Lee and H Granger ldquoManaging andsupporting product life cycle through engineering changemanagement for a complex productrdquo Research in EngineeringDesign vol 26 no 3 pp 189ndash217 2015

[35] D Zeimpekis and E Gallopoulos TMG A Matlab toolboxfor generating term-document matrices from text collections2006 httpscgroup20ceidupatrasgr8000tmg

[36] M J Zaki and W Meira Jr Data Mining and AnalysisFundamental Concepts and Algorithms Cambridge UniversityPress 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

2 Computational Intelligence and Neuroscience

Winner node(BMU)

Bidimensional output space

Multidimensional input space

Neighborhood NcxjΔmi(t)

mi(t)

c(xj)

Figure 1 SOM training algorithm Each neuron has a prototype vectorm119894 which corresponds to a point in the embedding input space Aninput vector x119895 will select that neuron with closestm119894 to it Adjustment of the weight vector for the winning output neuron and its neighborsthrough quantity Δm119894 Adapted from Ritter et al [7]

postchange stage involves less publication and deals withthe ex post facto exploration of effect of implementedengineering changes The analysis of engineering changesprocess belongs to the postchange stage and there are only fewapproaches concerning the analysis of engineering changesdata in complex products industry In this context one of themain challenges is dealing with free-form text contained inengineering changes documents which makes the data moredifficult to query search and extract This paper focuses onunstructured data contained in ECRs and proposes the textclustering for the postchange analysis of engineering changeprocess

Text clustering is an unsupervised learningmethodwheresimilar documents are grouped into clusters The goal isto create clusters that are coherent internally but clearlydifferent from each other Among the clustering methodsproposed in the literature Self-Organizing Map (SOM) hasattracted many researchers in recent years SOM is a neural-network model and algorithm that implements a characteris-tic nonlinear projection from the high-dimensional space ofinput signals onto a low-dimensional regular grid which canbe effectively utilized to visualize and explore properties ofthe data [10] With respect to other text clustering methodsSOM allows visualizing the similarity between documentswithin the low-dimensional grid Hence similar documentsmay be found in neighboring regions of the grid

In the literature text mining methods have been pro-posed in support of the engineering change process by Sharafiet al [11] Elezi et al [12] and Sharafi [13] In particularSharafi et al [11] focused on the causes of changes containedin ECRs and calculated term occurrences for all ECRs inorder to analyze occurrences of the keywords in differentprojects and to find pattern in the data Elezi et al [12]employed a semiautomatic text mining process to classifythe causes of iteration in engineering changes As a resultcost and technical categories of causes were identified asthe main reasons for the occurrence of iterations Sharafi[13] applied Knowledge Discovery in Database methods toanalyze historical engineering changes data in order to gain

insights in form of patterns within the database In detail apart of the study concerned the application of 119870-means 119870-Medoids DBSCAN and Support Vector Clustering methodsto cluster ECRs documents of an automotive manufacturer

This paper explores the use of SOM to the problem ofunsupervised clustering of ECR documents A case study ispresented and ECRs collected during the engineering changeprocess of a railway industry are analyzed The results showthat SOM text clustering has great potential to improveoverall knowledge reuse and exploitation in an engineeringchange process

The reminder of the paper is organized as follows InSection 2 the basic concepts of the SOM theory are intro-duced In Section 3 the SOM text based clustering methodis described In Section 4 the engineering change processin industry is described In Section 5 the case study andthe experimental results are both discussed In Section 6conclusions are given

2 The SOM Algorithm

The SOM originally proposed by Kohonen [14] is basedon the idea that systems can be designed to emulate thecollective cooperation of the neurons in the human brain Itis an unsupervised machine learning method widely used indatamining visualization of complex data image processingspeech recognition process control diagnostics in industryand medicine and natural language processing [15]

The algorithm of SOM consists in mapping 119872-dimensional input vectors x119895 to two-dimensional 119899119890119906119903119900119899119904or 119898119886119901119904 according to their characteristic features It reducesthe dimensions of data to a map helps to understandhigh-dimensional data and groups similar data together Asimple SOM consists of two layers The first includes nodesin the input space and the second the nodes in the outputspace A representation of SOM with output nodes in atwo-dimensional grid view is provided in Figure 1 SOMconsists of 119875 units each unit of index 119894 is associated with an119872-dimensional prototype vectorm119894 in the input space and a

Computational Intelligence and Neuroscience 3

position vector on a low-dimensional regular grid r119894 in theoutput space The steps of the SOM learning algorithm are asfollows

(1) Initialization Start with initial values of prototypevectors m119894 In the absence of any prior informationvalues of prototype vectorm119894 can be random or linearand are adjusted while the network learns

(2) Sampling Select a vector x119895 from the training inputspace The selection of x119895 can be random

(3) MatchingDetermine the Best Matching Unit (BMU)Vector x119895 is compared with all the prototype vectorsand the index 119888(x119895) of the BMU that is the prototypevectorm119888 which is closest to x119895 is chosen accordinglyto the smallest Euclidian distance as follows10038171003817100381710038171003817x119895 minus m119888

10038171003817100381710038171003817 = min119894

10038171003817100381710038171003817x119895 minus m11989410038171003817100381710038171003817 (1)

(4) UpdatingUpdate the BMU and its neighbors Adjust-ment of the prototype vector for the winning outputneuron and its neighbors are updated as

m119894 (119905 + 1) = m119894 (119905) + Δm119894 (119905) (2)

where 119905 = 0 1 2 is an index of the time The valueof Δm119894(119905) in (2) is computed as follows

Δm119894 (119905) = 120572 (119905) ℎ119888119894 (119905) (x119895 (119905) minus m119894 (119905)) (3)

where 120572(119905) is the learning-rate factor and ℎ119888119894(119905) theneighborhood function In particular the learning-rate factor 120572(119905) is comprised in [0 1] and is mono-tonically decreasing during the learning phase Theneighborhood function ℎ119888119894(119905) determines the distancebetween nodes of indexes 119888 and 119894 in the output layergrid A widely applied neighborhood kernel can bewritten in terms of the Gaussian function

ℎ119888119894 (119905) = exp(minus1003817100381710038171003817r119888 minus r1198941003817100381710038171003817221205902 (119905) ) (4)

where r119888 and r119894 are the position vectors of nodes 119888and 119894 and the parameter 120590(119905) defines the width ofthe kernel which corresponds to the radius of theneighborhood 119873119888(119905) 119873119888(119905) refers to a neighborhoodset of array points around node of index 119888 (Figure 1)The value ℎ119888119894(119905) decreases during learning from aninitial value often comparable to the dimension of theoutput layer grid to a value equal to one

During the learning of the SOM phases 2ndash4 are repeatedfor a number of successive iterations until the prototypevectorsm119894 represent as much as possible the input patternsx119895 that are closer to the neurons in the two-dimensional mapAfter initialization the SOM can be trained in a sequentialor batch manner [8] Sequential training is repetitive asbatch training but instead of sending all data vectors to themap for weight adjustment one data vector at a time is

sent to the network Once the SOM is trained each inputvector is mapped to one neuron of the map reducing high-dimensional input space to a low-dimensional output spaceThe map size depends on the type of application The biggersize map reveals more details of information whereas asmaller map is being chosen to guarantee the generalizationcapability

Before application SOMmethod requires predefining thesize and structure of the network the neighborhood functionand the learning function These parameters are generallyselected on the basis of heuristic information [7 8 16 17]

21 SOM Cluster Visualization The SOM is an extremelyversatile tool for visualizing high-dimensional data in lowdimension For visualization of SOM both the unified-distance matrix (U-matrix) [18] and Component Planes[19] are used The U-matrix calculates distances betweenneighboring map units and these distances can be visualizedto represent clusters using a color scale on the map

TheU-matrix technique is a single plot that shows clusterborders according to dissimilarities between neighboringunits The distance ranges of U-matrix visualized on themap are represented by different colors (or grey shades) Redcolors correspond to large distances that is large gaps existbetween the prototype vector values in the input space bluecolors correspond to small distance that is map units arestrongly clustered together U-matrices are useful tools forvisualizing clusters in input data without having any a prioriinformation about the clusters

Another important tool of visualization is ComponentPlanes that is a grid whose cells contain the value of the119901th dimension of a prototype vector displayed by variationof color It helps to analyze the contribution of each variableto cluster structures and the correlation between the differentvariables in the dataset

22 SOM Clustering Using 119870-Means Algorithm One of thedrawbacks of SOM analysis is that unlike other clustermethods the SOM has no distinct cluster boundaries Whendatasets become more complex it is not easy to distinguishthe cluster by pure visualization As described in Vesanto andAlhoniemi [10] in SOM the prototype nodes can be used forclustering instead of all input dataset LetC1 C119896 C119870denote a cluster partition composed of119870 clustersThe choiceof the best clustering can be determined by applying the well-known 119870-means algorithm [20] This algorithm minimizesan error function computed on the sum of squared distancesof each data point in each cluster The algorithm iterativelycomputes partitioning for the data and updates cluster centersbased on the error function In this approach the numberof clusters 119870 has to be fixed a priori Therefore 119870-meansalgorithm is run multiple times for each 119870 isin [2radic119873] where119873 is number of samples The best number of clusters 119870lowast canbe selected based on the Davies Bouldin Index (DBI) [21]This index is based on a ratio of within-cluster and between-cluster distances and is calculated as

DBI (119870) = 1119870119870sum119896=1

max119896 =119897

Δ (C119896) + Δ (C119897)120575 (C119896C119897) (5)

4 Computational Intelligence and Neuroscience

where 119870 is the number of clusters Δ(C119896) and Δ(C119897) and120575(C119896C119897) the within-cluster and between-cluster distancesrespectively The optimum number of clusters 119870lowast corre-sponds to the minimum value of DBI(119870) SOM neuralnetwork combined with other clustering algorithms was usedin Yorek et al [22] for visualization of studentsrsquo cognitivestructural models

3 SOM-Based Text Clustering

Text clustering is an unsupervised process used to separatea document collection into some clusters on the basis of thesimilarity relationship between documents in the collection[17] Suppose C = 1198891 119889119895 119889119873 be a collection of 119873documents to be clustered The purpose of text clustering isto divide C into C1 C119896 C119870 clusters with C1 sdot sdot sdot cupC119896 sdot sdot sdot cup C119870 = C

SOM text clustering can be divided into two main phases[23 24] The first phase is document preprocessing whichconsists in using Vector Space Model (VSM) to generateoutput document vectors from input text documents Thesecond one is document clustering that applies SOM on thegenerated document vectors to obtain output clusters

31 Document Preprocessing An important preprocessingaspect for text clustering is to consider how the text contentcan be represented in the form of mathematical expressionfor further analysis and processing

By means of VSM each 119889119895 (119895 = 1 119873) can berepresented as vector in119872-dimensional space In detail eachdocument119889119895 can be represented by a numerical feature vectorx119895

x119895 = [1199081119895 119908119901119895 119908119872119895] (6)

Each element 119908119901119895 of the vector usually represents a word (ora group of words) of the document collection that is the sizeof the vector is defined by the number of words (or groups ofwords) of the complete document collection

The simplest approach is to assign to each document theTerm Frequency and Inverse Document Frequency (TF-IDF)weighting scheme [25 26] The TF-IDF weighting schemeassigns to each term 119901 in the 119895th document a weight 119908119901119895computed as

119908119901119895 = 119905119891119901119895 times log 119873119889119891119901 (7)

where 119905119891119901119895 is the term frequency that is the number of timesthat term 119901 appears in the document 119895 and 119889119891119901 is the numberof documents in the collection which contains term 119901

According to TF-IDF weighting scheme 119908119901119895 is(1) higher when the term 119901 occurs many times within

a small number of documents (thus lending highdiscriminating power to those documents)

(2) lower when the term 119901 occurs fewer times in a doc-ument or occurs in many documents (thus offering aless pronounced relevance signal)

(3) lower when the term 119901 occurs in virtually all docu-ments

Before preprocessing the documents by the TF-IDFweighting scheme The size of the list of terms created fromdocuments can be reduced using methods of stop wordsremoval and stemming [23 24]

In text based document in fact there are a great numberof noninformative words such as articles prepositions andconjunctions called stop words A stop-list is usually builtwith words that should be filtered in the document represen-tation process Words that are to be included in the stop-listare language and task dependent however a set of generalwords can be considered stop words for almost all tasks suchas ldquoandrdquo and ldquoorrdquo Words that appear in very few documentsare also filtered

Another common phase in preprocessing is stemmingwhere the word stem is derived from the occurrence ofa word by removing case and inflection information Forexample ldquocomputesrdquo ldquocomputingrdquo and ldquocomputerrdquo are allmapped to the same stem ldquocomputrdquo Stemming does notalter significantly the information included in documentrepresentation but it does avoid feature expansion

32 SOM Text Clustering Once obtaining the feature vectorx119895 in (6) associated with each text 119889119895 the SOM algorithmdescribed in Section 2 can be applied for text clustering Thetext clustering method explained above is known as ldquoSOMplus VSMrdquo other variants to it have been proposed by Liuet al [27 28] An overview of the application of SOM intext clustering is provided by Liu et al [17] This kind ofclustering method was employed in domains such as patent[29] financial services [30] and public policy analysis [31]

4 The Engineering Change Process inComplex Products Industry

For complex products such as trains automobiles or air-craft engineering changes are unavoidable and productsor components have to be redesigned and retrofitted toaccommodate the new changes to new installations andproducts In these environments an engineering change caninvolve the risk of due time delay Huang et al [32] carriedout a survey about the effects of engineering changes on fourmanufacturing industries and found that the time investedin processing an engineering change varies from 2 to 36person days In Angers [33] it is estimated that more than35 of todayrsquos manufacturing resources are just devoted tomanaging changes to engineering drawings manufacturingplans and scheduling requirements Engineering changeprocesses in complex environments such as automobile trainand aviation industry were also studied by Leng et al [3] andSubrahmanian et al [34]

The phases of a real engineering change process incomplex products industry can be summarized as follows(Figure 2)

(1) A request of an engineering change is made and sentto an engineering manager In this phase standardECR forms are used outlining the reason of thechange the type of the change which components or

Computational Intelligence and Neuroscience 5

Change trigger

(2)Identification of possible

solution(s) to change request

(3)Technical evaluation of the

change

(4)Economic evaluation of the

change

(5)Selection and approval of a

solution

(6)Implementation of solution

(7)Updating as-built documents

(1)Engineering change request

Before approval

During approval

After approval

Figure 2The real engineering change process in complex productsenvironments Adapted from Jarratt et al [5]

systems are likely to be affected the person and thedepartment making the request and so forth

(2) Potential solutions to the request for change areidentified

(3) Technical evaluation of the change is carried out Inthis phase the technical impact of implementing eachsolution is assessed Various factors are consideredfor example the impact upon design and productrequirements production schedule and resources tobe devoted

(4) Economic evaluation of the change is performedThe economic risk of implementing each solutionis assessed In this phase costs related to extraproduction times replacements of materials penaltyfor missed due date and so forth are estimated

(5) Once a particular solution is selected it is approvedor not approved The change is reviewed and a costbenefit analysis is carried out When a solution isapproved the engineering change order is preparedand issued

(6) Implementation of the engineering change and iden-tification of the documents such as drawings are tobe updated

(7) Update of the as-built documents occurs As-builtdocuments are usually the original design documentsrevised to reflect any changes made during the pro-cess that is design changes material changes and soforth

Iterations of the process occur for example when a par-ticular solution has negative impact on product requirementsor is too risky to be implemented so the process returns tophase 2 and another solution is identified Another iterationis possible when the costs of a solution are too high or morerisk analysis is required or when the proposed solution iscompletely refused

As shown in Figure 2 no review process of similarchanges faced in the past is carried out during the processor at the endThis aspect is emphasized by Jarratt et al [5] byhighlighting that after a period of time the change should bereviewed to verify if it achieved what was initially intendedand what lessons can be learned for future change processVarious factors can discourage examining the solutionsadopted in the past to a particular change First of all thereis the lack of opportune methods to analyze the documentscollected during the process that is ECR ECRs are oftenforms containing partswritten in natural languageAnalyzingthese kinds of documents in the design phase of a product ora component or when a new change request occurs could bevery time consuming without an appropriate solution

In this context SOM text clustering application canimprove the process When a new ECR occurs in fact ECRsmanaged in the past and similar to the current request couldbe analyzed in order to evaluate the best solution and toavoid repeating the same mistakes made in the past In orderto explore the existence of similarity between the differentECRs texts the first step is to verify the potential clusteringpresent in the analyzed dataset The application of SOM textclustering to ECR documents is explored in the next section

5 The Use of SOM Text Clusteringfor ECRs Analysis

In order to test SOM text clustering we used a datasetof 119873 = 54 ECRs representing some engineering changesmanaged during the engineering change process of a railwaycompany The dataset included the natural language writtendescriptions of the causes of changes contained in the ECRsforms The documents were written in Italian language andthe VSM in Section 3 was used to generate output vectorsfrom input text documents The number of terms that is thedimension of each vector x119895 associated with each document

6 Computational Intelligence and Neuroscience

05

11

17

Figure 3 U-matrix of the 54 ERCs texts preprocessed throughthe VSM Color scale is related to distances between map unitsRed colors represent large distances and blue colors represent smalldistances

in the dataset after the stop word removal and stemmingprocesses was equal to 119872 = 361

In our work we used the MATLAB software SpecificallyTerm toMatrix Generator (TMG) toolbox [35] for documentpreprocessing and SOM toolbox for SOM text clustering [8]were employed

The map size of the SOM is computed through theheuristic formula in Vesanto et al [8] In detail the numberof neurons is computed as 5radic119873 where 119873 is the numberof training samples Map shape is a rectangular grid withhexagonal lattice The neighborhood function is Gaussianand the map is trained using the batch version of SOM Afterthe map is trained each data vector is mapped to the mostsimilar prototype vector in the map that is the BMU whichresults from the matching step in the SOM algorithm In ourcase the network structure is a 2D-lattice of 7 times 5 hexagonal

A first step in cluster analysis based on SOM is based onvisual inspection through U-matrix and Component Planes

The U-matrix obtained on the application of SOM toECR dataset is shown in Figure 3 In order to representadditional information (ie distances) the SOM map size isaugmented by inserting an additional neuron between eachpair of neurons and reporting the distance between the twoneurons The U-matrix and the different colors linked to thedistance between neurons on the map show that five clustersare present in the dataset Each cluster has been highlightedby a circle

Besides looking at the overall differences in the U-matrixit is interesting as well to look at the differences betweeneach component present in the input vectors meaning thatwe look at differences regarding each component associatedwith a single ldquotermrdquo in the input dataset The total numberof Component Planes obtained from the SOM correspondsto the total number of terms in our dataset that is 119872 =361 For illustrative purpose Figure 4 shows two ComponentPlanes chosen as an example The first Component Plane

Table 1 Class label and number of ECR texts

Class Label Number of ECR texts(1) Metal-Sheet MS 12(2) Carter CR 8(3) Antenna AN 8(4) Semi-Finished Round SR 7(5) Hydraulic Panel HP 9(6) Pneumatic System PS 10

(Figure 4(a)) is associated with the term of index 119901 = 48that is the term ldquoAntennardquo the second one (Figure 4(b)) isrelated to the term of index 119901 = 195 that is the term ldquoMetal-Sheetrdquo The difference between the two terms in the datasetcan be represented by considering for example the map unitin top left corner of the two figures This map unit has highvalues for variable ldquoTerm 119901 = 48rdquo and low values for variableldquoTerm119901 = 195rdquo From the observation of Component Planeswe can conclude that there is no correlation between the twoterms As a matter of fact these two terms were never usedtogether into the same documents

As shown above theU-matrix and theComponent Planesallow obtaining a rough cluster structure of the ECR datasetTo get a finer clustering result the prototype vectors fromthe map were clustered using 119870-means algorithm The bestnumber of clusters in the SOM map grip can be determinedby using the DBI values Figure 5(a) shows the DBI values byvarying the number of clusters 119870 in [2 7] The elbow pointin the graph shows that the optimum number of cluster 119870lowastcorresponding to theminimum value of DBI(119870) is equal to 5The clustering result of SOMobtained by using119870-meanswith119870lowast = 5 clusters is shown in Figure 5(b)The BMUs belongingto each cluster are represented with a different color and ineach hexagon the number of documents associated with eachBMU is provided

51 External Validation of SOM Text Clustering In thereference case study the true classification of each ECR inthe dataset was provided by process operators Thereforethis information was used in our study in order to per-form an external validation of the SOM text clustering Inparticular each ECR text was analyzed and classified withreference to themain component involved in the engineeringchange (namely ldquoMetal-Sheetrdquo ldquoCarterrdquo ldquoAntennardquo ldquoSemi-Finished Roundrdquo ldquoHydraulic Panelrdquo and ldquoPneumatic Sys-temrdquo) Table 1 reports the number of ECRs related to eachcomponent along with the labels used in order to classify theECR documents (namely ldquoMSrdquo ldquoCRrdquo ldquoANrdquo ldquoSRrdquo ldquoHPrdquo andldquoPSrdquo respectively) Although a classification is available in thespecific case study of this paper it is worth noting that oftensuch information may be unavailable

By superimposing the classification information themap grid resulting by the training of the SOM is as inFigure 6 where each hexagon reports the classification labelof documents sharing a given BMU (within brackets thenumber of associated documents) From Figure 6 it can beobserved that the unsupervised clustering produced by theSOMalgorithm is quite coherent with the actual classification

Computational Intelligence and Neuroscience 7

Term p = 48

(a)d

0

05

1

Term p = 195

(b)

Figure 4 Component Planes of two variables chosen as example (a) Component Plane related to the variable of index 119901 = 48 correspondingto the term ldquoAntennardquo (b) Component Plane of the variable of index 119901 = 195 corresponding to the term ldquoMetal-Sheetrdquo (a) and (b) are linkedby position in each figure the hexagon in a certain position corresponds to the same map units Each Component Plane shows the values ofthe variable in each map unit using color-coding Letter ldquodrdquo means that denormalized values are used that is the values in the original datacontext are used [8]

1 2 3 4 5 6 708

09

1

11

12

13

14

15

16

17

18

K

DBI

(K)

(a)

5

3

3

6

1

2

1

2

3

1

1

1

1

2

2

1

1

6

4

3

3

2

(b)

Figure 5 (a) Davies Bouldin Index (DBI) values for number of clusters119870 in [2 7] (b) Clustering result of SOM using119870-means with119870lowast = 5clusters In each hexagon the number of documents sharing the same BMU is provided

given by process operators in fact ECR documents sharingthe same classification label are included in BMUs belongingto the same cluster It is worth noting that the actual labelis assigned to each document after the SOM unsupervisedtraining has been carried out From Figure 6 it can be also

noted that ECRs classified either as ldquoPSrdquo and ldquoHPrdquo are allincluded in a unique cluster Furthermore two documentsout of twelve with label ldquoMSrdquo are assigned to clusters thatinclude different labels namely ldquoCRrdquo ldquoPSrdquo and ldquoHPrdquo Weinvestigated this misclassification and we found that causes

8 Computational Intelligence and Neuroscience

AN(5)

CR(3)

CR(3)

SR(6)

AN(1)

CR(2)

SR(1)

AN(2)

PS(3)

HP(1)

MS(1)

MS(1)

PS(1)

HP(2)

MS(2)

MS(1)

MS(1)

PS(6)

HP(4)

HP(2)MS(1)

MS(3)

MS(2)

Figure 6 Labeled map grid of SOM In each hexagon labels andtotal number of documents sharing the same BMU Coloring refersto 119870-means unsupervised clustering of the SOM

Table 2 Contingency matrix N

MS CR AN SR HP PS 119899119903T1 T2 T3 T4 T5 T61198621 0 0 8 0 0 0 81198622 1 0 0 0 9 10 201198623 1 8 0 0 0 0 91198624 0 0 0 7 0 0 71198625 10 0 0 0 0 0 10119898119896 12 8 8 7 9 10 119873 = 54

of changes described in these two ldquoMSrdquo documents are quitesimilar to those contained in documents labeled as ldquoCRrdquoldquoPSrdquo and ldquoHPrdquo

Given the actual classification the quality of the obtainedclustering can be evaluated by computing four indices purityprecision recall and 119865-measure [36]

Let T = T1 T119903 T119877 be the true partitioninggiven by the process operators where the partition T119903consists of all the documents with label 119903 Let 119898119903 = |T119903|denote the number of documents in true partition T119903 Alsolet C = C1 C119896 C119870 denote the clustering obtainedvia the SOM text clustering algorithm and 119899119896 = |C119896|denote the number of documents in cluster C119896 The 119870 times 119877contingency matrix N induced by clustering C and the truepartitioning T can be obtained by computing the elementsN(119896 119903) = 119899119896119903 = |C119896 cap T119903| where 119899119896119903 denotes the number ofdocuments that are common to clusterC119896 and true partitionT119903 The contingency matrix for SOM text clustering of ECRsis reported in Table 2

Starting fromTable 2 the following indices are computed

(i) Purity Index The cluster-specific purity index ofclusterC119896 is defined as

purity119896 = 1119899119896119877max119903=1

119899119896119903 (8)

The overall purity index of the entire clustering C iscomputed as

purity = 119870sum119896=1

119899119896119873purity119896 = 1119873119870sum119896=1

119877max119903=1

119899119896119903 (9)

As shown in Table 3 clusters C1 C4 and C5 havepurity index equal to 1 that is they contain entitiesfrom only one partition Clusters C2 and C3 gatherentities from different partitions that isT1T5 andT6 for cluster C2 and T1 T2 for cluster C3 Theoverall purity index is equal to 079

(ii) Precision Index Given a clusterC119896 letT119903119896 denote themajority partition that contains the maximum num-ber of documents fromC119896 that is 119903119896 = argmax119877119903=1119899119896119903The precision index of a clusterC119896 is given by

prec119896 = 1119899119896119877max119903=1

119899119896119903 = 119899119896119903119896119899119896 (10)

For clustering in Table 2 the majority partitionsare T31 T62 T23 T44 and T15 Precision indicesshow that all documents gathered in clusters C1C4 and C5 belong to the corresponding majoritypartitionsT31 T44 andT15 For clusterC2 the 50of documents belong to T62 and finally the 88 ofdocuments in clusterC3 belong toT23

(iii) Recall Index Given a clusterC119896 it is defined as

recall119896 = 11989911989611990311989610038161003816100381610038161003816T119903119896 10038161003816100381610038161003816 = 119899119896119903119896119898119903119896 (11)

where 119898119903119896 = |T119903119896 | It measures the fraction ofdocuments in partition T119903119896 shared in common withclusterC119896The recall indices reported in Table 3 showthat clusters C1 C2 C3 and C4 shared in commonthe 100 of documents in majority partitions T31 T62 T23 andT44 respectively ClusterC5 shared the83 of documents inT15

(iv) 119865-Measure Index It is the harmonic mean of theprecision and recall values for each cluster The 119865-measure for cluster 119862119896 is therefore given as

119865119896 = 2 sdot prec119896 sdot recall119896prec119896 + recall119896

= 2119899119896119903119896119899119896 + 119898119903119896 (12)

The overall 119865-measure for the clustering C is themean of the clusterwise 119865-measure values

119865 = 1119870119870sum119896=1

119865119896 (13)

Table 3 shows that 119865-measure of clusters C1 and C4is equal to 1 while other values are less than 1 Thelow values of 119865-measures for clusters C2 C3 andC5 depend on a low precision index for clusters C2and C3 and on a low recall index for cluster C5Consequently the overall 119865-measure is equal to 090

Computational Intelligence and Neuroscience 9

Table 3 Clustering validation

Cluster-specific and overall indicesPurity

purity1 = 1 purity2 = 05 purity3 = 089 purity4 = 1 purity5 = 1purity = 079Precision

prec1 = 1 prec2 = 05 prec3 = 089 prec4 = 1 prec5 = 1Recall

recall1 = 1 recall2 = 1 recall3 = 1 recall4 = 1 recall5 = 083F-measure

1198651 = 1 1198652 = 066 1198653 = 094 1198654 = 1 1198655 = 091119865 = 090Table 4 Results of SOM-based classification In the first and in the second column labels and number of ECR texts of the actual classificationare reported In the third column the number of ECRs correctly classified through the label of first associated BMU In the fourth column thenumber of document associated with an empty first BMU but correctly classified by considering the label of documents sharing the secondassociated BMU

Labels Number of ECRs Number of ECRs classified through the 1st BMU Number of ECRs classified through the 2nd BMUMS 12 10 2CR 8 6 2AN 8 7 1SR 7 6 1HP 9 6 3PS 10 10 mdash

Given the actual classification the SOM can be furthervalidated through a leave-one-out cross validation techniquein order to check its classification ability In particular119873 minus 1ECR documents are used for training and the remaining onefor testing (iterating until each ECR text in the data has beenused for testing)

At each iteration once SOM has been trained on 119873 minus 1ECRs and when the testing sample is presented as input aBMU is selected in the matching step of the SOM algorithmThe label of the training documents associatedwith that BMUis considered In the case of an empty BMU that is whichresults are not associated with any training documents theclosest one associated with at least one training document isconsidered instead while in case of a BMU associated withtraining documents with more than one label the label withgreater number of documents is considered

Table 4 shows the results of the leave-one-out cross vali-dation For each row that is for a given ECR label the secondcolumn reports the total number of documents in the datasetwhile the last two columns report the number of testingECRs correctly classified by the SOM In particular thethird column reports the number of testing ECRs correctlyclassified as they were connected to a first BMU associatedwith training documents with the same labelThe last columnrefers to the number of ECRs associated with an empty firstBMU that nevertheless resulted closest to a second BMUrelated to documents belonging to the same class of thetesting sample Also cross validation study demonstrates thatlabels given by SOM are coherent with actual classificationand confirms the ability of SOM as classification tool

6 Conclusions

In this paper a real case study regarding the engineeringchange process in complex products industry was conductedThe study concerned the postchange stage of the engineeringchange process in which past engineering changes dataare analyzed to discover information exploitable in newengineering changes In particular SOM was used for clus-tering of natural language written texts produced during theengineering change process The analyzed texts included thedescriptions of the causes of changes contained in the ECRforms Firstly SOM algorithm was used as clustering tool tofind relationships between the ECR texts and to cluster themaccordingly Subsequently SOM was tested as classificationtool and the results were validated through a leave-one-outcross validation technique

The results of the real case study showed that the use of theSOM text clustering can be an effective tool in improvementof the engineering change process analysis In particularsome of the advantages highlighted in this study are asfollows

(1) Text mining methods allow analyzing unstructureddata and deriving high-quality informationThemaindifficulty in ECR analysis consisted in analyzingnatural language written texts

(2) Clustering analysis of past ECRs stored in the com-pany allows automatically gathering ECRs on thebasis of similarity between documents When a newchange triggers the company can quickly focus on

10 Computational Intelligence and Neuroscience

the cluster of interest Clustering can support thecompany to know if a similar change was alreadymanaged in the past to analyze the best solutionadopted and to learn avoiding the same mistakesmade in the past

(3) Use of SOM for ECRs text clustering allows automat-ically organizing large documents collection Withrespect to other clustering algorithms the mainadvantage of SOM text clustering is that the similarityof the texts is preserved in the spatial organization ofthe neurons The distance among prototypes in theSOMmap can therefore be considered as an estimateof the similarity between documents belonging toclusters In addition SOM can first be computedusing a representative subset of old input data Newinput can be mapped straight into the most similarmodel without recomputing the whole mapping

Nevertheless the study showed some limitations of theapplication of SOM text clustering and classification A firstlimitation is linked to natural language written texts Theterms contained in different texts may be similar even if anengineering change request concerns a different productThesimilarity of terms may influence the performance of SOM-based clustering A second limitation is linked to the use ofSOMas classificationmethod Classification indeed requiresthe labeling of a training datasetThis activity requires a deepknowledge of the different kinds of ECRs managed duringthe engineering change process andmay be difficult and timeconsuming

As a summary research on use of SOM text clusteringin engineering change process analysis appears to be apromising direction for further research A future directionof the work will consider the use of SOM text clustering ona larger dataset of ECRs comparing SOM with other clus-tering algorithms such as119870-means or hierarchical clusteringmethods Another direction of future research concerns theanalysis of SOM robustness to parameters selection (iethe size and structure of the map parameters and kinds oflearning and neighborhood functions)

Symbols

120572(119905) Scalar-valued learning-rate factorN Contingency matrix120575(C119896C119897) Between-cluster distanceΔ(C119896) Within-cluster distanceΔm119894(119905) Adjustment of the prototype vectorm119894(119905)C Collection of documentsC119896 Cluster of documents of index 119896T True partition of documents given by

process operatorsT119903 True partition of documents with label 119903T119903119896 Majority partition containing the

maximum number of documents fromC119896120590(119905) Width of the kernel which corresponds tothe radius of 119873119888(119905)119888(x119895) Index of the BMU associated with x119895119889119895 Document of index 119895

DBI(119870) Davies Bouldin Index for 119870 clusters119889119891119901 Number of documents in the collectionCwhich contains term 119901ℎ119888119894(119905) Neighborhood function119870 Total number of clusters119870lowast Optimum number of clusters119872 Total number of terms in the documentscollection119898119903 Number of documents in partitionT119903119873 Total number of processed documents119873119888(119905) Neighborhood around node 119888119899119896 Number of documents in clusterC119896119899119896119903 Number of documents common to clusterC119896 and partitionT119903119875 Total number of units of SOM119905 Index of time119905119891119901119895 Term frequency of term 119901 in the 119895thdocument119908119901119895 Weight associated with term 119901 in the 119895thdocument

m119888 Prototype vector associated with the BMUm119894 Prototype vector of index 119894r119888 Position vector of index 119888r119894 Position vector of index 119894x119895 Numerical feature vector associated with119895th document

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research has been funded by MIUR ITIA CNR andCluster Fabbrica Intelligente (CFI) Sustainable Manufactur-ing project (CTN01 00163 148175) and Vis4Factory project(Sistemi Informativi Visuali per i processi di fabbrica nelsettore dei trasporti PON02 00634 3551288)The authors arethankful to the technical and managerial staff of MerMecSpA (Italy) for providing the industrial case study of thisresearch

References

[1] T Jarratt J Clarkson and C E Eckert Design Process Improve-ment Springer New York NY USA 2005

[2] C Wanstrom and P Jonsson ldquoThe impact of engineeringchanges on materials planningrdquo Journal of Manufacturing Tech-nology Management vol 17 no 5 pp 561ndash584 2006

[3] S Leng L Wang G Chen and D Tang ldquoEngineering changeinformation propagation in aviation industrial manufacturingexecution processesrdquo The International Journal of AdvancedManufacturing Technology vol 83 no 1 pp 575ndash585 2016

[4] W-H Wu L-C Fang W-Y Wang M-C Yu and H-Y KaoldquoAn advanced CMII-based engineering change managementframework the integration of PLM and ERP perspectivesrdquoInternational Journal of Production Research vol 52 no 20 pp6092ndash6109 2014

Computational Intelligence and Neuroscience 11

[5] T A W Jarratt C M Eckert N H M Caldwell and P JClarkson ldquoEngineering change an overview and perspective onthe literaturerdquo Research in Engineering Design vol 22 no 2 pp103ndash124 2011

[6] B Hamraz N H M Caldwell and P J Clarkson ldquoA holisticcategorization framework for literature on engineering changemanagementrdquo Systems Engineering vol 16 no 4 pp 473ndash5052013

[7] H Ritter T Martinetz and K Schulten Neural Computationand Self-Organizing Maps An Introduction Addison-WesleyNew York NY USA 1992

[8] J Vesanto J Himberg E Alhoniemi and J Parhankangas SOMtoolbox for Matlab 5 2005 httpwwwcishutfisomtoolbox

[9] E Fricke BGebhardHNegele andE Igenbergs ldquoCopingwithchanges causes findings and strategiesrdquo Systems Engineeringvol 3 no 4 pp 169ndash179 2000

[10] J Vesanto and E Alhoniemi ldquoClustering of the self-organizingmaprdquo IEEE Transactions on Neural Networks vol 11 no 3 pp586ndash600 2000

[11] A Sharafi P Wolf and H Krcmar ldquoKnowledge discovery indatabases on the example of engineering change managementrdquoin Proceedings of the Industrial Conference on Data Mining-Poster and Industry Proceedings pp 9ndash16 IBaI July 2010

[12] F Elezi A Sharafi A Mirson P Wolf H Krcmar and ULindemann ldquoA Knowledge Discovery in Databases (KDD)approach for extracting causes of iterations in EngineeringChange Ordersrdquo in Proceedings of the ASME InternationalDesign Engineering Technical Conferences and Computers andInformation in Engineering Conference pp 1401ndash1410 2011

[13] A Sharafi Knowledge Discovery in Databases an Analysis ofChange Management in Product Development Springer 2013

[14] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo Biological Cybernetics vol 43 no 1 pp 59ndash691982

[15] T Kohonen ldquoEssentials of the self-organizing maprdquo NeuralNetworks vol 37 pp 52ndash65 2013

[16] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[17] Y-C Liu M Liu and X-L Wang ldquoApplication of self-organizing maps in text clustering a reviewrdquo in Applicationsof Self-Organizing Maps M Johnsson Ed chapter 10 InTechRijeka Croatia 2012

[18] A Ultsch and H P Siemon ldquoKohonenrsquos self organizing featuremaps for exploratory data analysisrdquo in Proceedings of theProceedings of International Neural Networks Conference (INNCrsquo90) pp 305ndash308 Kluwer Paris France 1990

[19] J Vesanto ldquoSOM-based data visualization methodsrdquo IntelligentData Analysis vol 3 no 2 pp 111ndash126 1999

[20] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 1967

[21] D L Davies and DW Bouldin ldquoA cluster separation measurerdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 1 no 2 pp 224ndash227 1979

[22] N Yorek I Ugulu and H Aydin ldquoUsing self-organizingneural network map combined with Wardrsquos clustering algo-rithm for visualization of studentsrsquo cognitive structural modelsabout aliveness conceptrdquoComputational Intelligence andNeuro-science vol 2016 Article ID 2476256 14 pages 2016

[23] T Honkela S Kaski K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo in Proceedingsof theWork shop on Self-OrganizingMaps (WSOM rsquo97) pp 310ndash315 Espoo Finland June 1997

[24] S Kaski T Honkela K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo Neurocomput-ing vol 21 no 1ndash3 pp 101ndash117 1998

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information ProcessingampManagement vol24 no 5 pp 513ndash523 1988

[26] G Salton and R K Waldstein ldquoTerm relevance weights in on-line information retrievalrdquo Information Processing amp Manage-ment vol 14 no 1 pp 29ndash35 1978

[27] Y C Liu X Wang and C Wu ldquoConSOM a conceptional self-organizingmapmodel for text clusteringrdquoNeurocomputing vol71 no 4ndash6 pp 857ndash862 2008

[28] Y-C Liu C Wu and M Liu ldquoResearch of fast SOM clusteringfor text informationrdquo Expert Systems with Applications vol 38no 8 pp 9325ndash9333 2011

[29] T Kohonen S Kaski K Lagus et al ldquoSelf organization of amassive document collectionrdquo IEEE Transactions on NeuralNetworks vol 11 no 3 pp 574ndash585 2000

[30] J Zhu and S Liu ldquoSOMnetwork based clustering analysis of realestate enterprisesrdquo American Journal of Industrial and BusinessManagement vol 4 no 3 pp 167ndash173 2014

[31] B C Till J Longo A R Dobell and P F Driessen ldquoSelf-organizing maps for latent semantic analysis of free-form textin support of public policy analysisrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 1pp 71ndash86 2014

[32] G Q Huang W Y Yee and K L Mak ldquoCurrent practice ofengineering changemanagement in Hong Kongmanufacturingindustriesrdquo Journal of Materials Processing Technology vol 139no 1ndash3 pp 481ndash487 2003

[33] S Angers ldquoChanging the rules of the lsquoChangersquo gamemdashem-ployees and suppliers work together to transform the 737757production systemrdquo 2002 httpwwwboeingcomnewsfron-tiersarchive2002mayi ca2html

[34] E Subrahmanian C Lee and H Granger ldquoManaging andsupporting product life cycle through engineering changemanagement for a complex productrdquo Research in EngineeringDesign vol 26 no 3 pp 189ndash217 2015

[35] D Zeimpekis and E Gallopoulos TMG A Matlab toolboxfor generating term-document matrices from text collections2006 httpscgroup20ceidupatrasgr8000tmg

[36] M J Zaki and W Meira Jr Data Mining and AnalysisFundamental Concepts and Algorithms Cambridge UniversityPress 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

Computational Intelligence and Neuroscience 3

position vector on a low-dimensional regular grid r119894 in theoutput space The steps of the SOM learning algorithm are asfollows

(1) Initialization Start with initial values of prototypevectors m119894 In the absence of any prior informationvalues of prototype vectorm119894 can be random or linearand are adjusted while the network learns

(2) Sampling Select a vector x119895 from the training inputspace The selection of x119895 can be random

(3) MatchingDetermine the Best Matching Unit (BMU)Vector x119895 is compared with all the prototype vectorsand the index 119888(x119895) of the BMU that is the prototypevectorm119888 which is closest to x119895 is chosen accordinglyto the smallest Euclidian distance as follows10038171003817100381710038171003817x119895 minus m119888

10038171003817100381710038171003817 = min119894

10038171003817100381710038171003817x119895 minus m11989410038171003817100381710038171003817 (1)

(4) UpdatingUpdate the BMU and its neighbors Adjust-ment of the prototype vector for the winning outputneuron and its neighbors are updated as

m119894 (119905 + 1) = m119894 (119905) + Δm119894 (119905) (2)

where 119905 = 0 1 2 is an index of the time The valueof Δm119894(119905) in (2) is computed as follows

Δm119894 (119905) = 120572 (119905) ℎ119888119894 (119905) (x119895 (119905) minus m119894 (119905)) (3)

where 120572(119905) is the learning-rate factor and ℎ119888119894(119905) theneighborhood function In particular the learning-rate factor 120572(119905) is comprised in [0 1] and is mono-tonically decreasing during the learning phase Theneighborhood function ℎ119888119894(119905) determines the distancebetween nodes of indexes 119888 and 119894 in the output layergrid A widely applied neighborhood kernel can bewritten in terms of the Gaussian function

ℎ119888119894 (119905) = exp(minus1003817100381710038171003817r119888 minus r1198941003817100381710038171003817221205902 (119905) ) (4)

where r119888 and r119894 are the position vectors of nodes 119888and 119894 and the parameter 120590(119905) defines the width ofthe kernel which corresponds to the radius of theneighborhood 119873119888(119905) 119873119888(119905) refers to a neighborhoodset of array points around node of index 119888 (Figure 1)The value ℎ119888119894(119905) decreases during learning from aninitial value often comparable to the dimension of theoutput layer grid to a value equal to one

During the learning of the SOM phases 2ndash4 are repeatedfor a number of successive iterations until the prototypevectorsm119894 represent as much as possible the input patternsx119895 that are closer to the neurons in the two-dimensional mapAfter initialization the SOM can be trained in a sequentialor batch manner [8] Sequential training is repetitive asbatch training but instead of sending all data vectors to themap for weight adjustment one data vector at a time is

sent to the network Once the SOM is trained each inputvector is mapped to one neuron of the map reducing high-dimensional input space to a low-dimensional output spaceThe map size depends on the type of application The biggersize map reveals more details of information whereas asmaller map is being chosen to guarantee the generalizationcapability

Before application SOMmethod requires predefining thesize and structure of the network the neighborhood functionand the learning function These parameters are generallyselected on the basis of heuristic information [7 8 16 17]

21 SOM Cluster Visualization The SOM is an extremelyversatile tool for visualizing high-dimensional data in lowdimension For visualization of SOM both the unified-distance matrix (U-matrix) [18] and Component Planes[19] are used The U-matrix calculates distances betweenneighboring map units and these distances can be visualizedto represent clusters using a color scale on the map

TheU-matrix technique is a single plot that shows clusterborders according to dissimilarities between neighboringunits The distance ranges of U-matrix visualized on themap are represented by different colors (or grey shades) Redcolors correspond to large distances that is large gaps existbetween the prototype vector values in the input space bluecolors correspond to small distance that is map units arestrongly clustered together U-matrices are useful tools forvisualizing clusters in input data without having any a prioriinformation about the clusters

Another important tool of visualization is ComponentPlanes that is a grid whose cells contain the value of the119901th dimension of a prototype vector displayed by variationof color It helps to analyze the contribution of each variableto cluster structures and the correlation between the differentvariables in the dataset

22 SOM Clustering Using 119870-Means Algorithm One of thedrawbacks of SOM analysis is that unlike other clustermethods the SOM has no distinct cluster boundaries Whendatasets become more complex it is not easy to distinguishthe cluster by pure visualization As described in Vesanto andAlhoniemi [10] in SOM the prototype nodes can be used forclustering instead of all input dataset LetC1 C119896 C119870denote a cluster partition composed of119870 clustersThe choiceof the best clustering can be determined by applying the well-known 119870-means algorithm [20] This algorithm minimizesan error function computed on the sum of squared distancesof each data point in each cluster The algorithm iterativelycomputes partitioning for the data and updates cluster centersbased on the error function In this approach the numberof clusters 119870 has to be fixed a priori Therefore 119870-meansalgorithm is run multiple times for each 119870 isin [2radic119873] where119873 is number of samples The best number of clusters 119870lowast canbe selected based on the Davies Bouldin Index (DBI) [21]This index is based on a ratio of within-cluster and between-cluster distances and is calculated as

DBI (119870) = 1119870119870sum119896=1

max119896 =119897

Δ (C119896) + Δ (C119897)120575 (C119896C119897) (5)

4 Computational Intelligence and Neuroscience

where 119870 is the number of clusters Δ(C119896) and Δ(C119897) and120575(C119896C119897) the within-cluster and between-cluster distancesrespectively The optimum number of clusters 119870lowast corre-sponds to the minimum value of DBI(119870) SOM neuralnetwork combined with other clustering algorithms was usedin Yorek et al [22] for visualization of studentsrsquo cognitivestructural models

3 SOM-Based Text Clustering

Text clustering is an unsupervised process used to separatea document collection into some clusters on the basis of thesimilarity relationship between documents in the collection[17] Suppose C = 1198891 119889119895 119889119873 be a collection of 119873documents to be clustered The purpose of text clustering isto divide C into C1 C119896 C119870 clusters with C1 sdot sdot sdot cupC119896 sdot sdot sdot cup C119870 = C

SOM text clustering can be divided into two main phases[23 24] The first phase is document preprocessing whichconsists in using Vector Space Model (VSM) to generateoutput document vectors from input text documents Thesecond one is document clustering that applies SOM on thegenerated document vectors to obtain output clusters

31 Document Preprocessing An important preprocessingaspect for text clustering is to consider how the text contentcan be represented in the form of mathematical expressionfor further analysis and processing

By means of VSM each 119889119895 (119895 = 1 119873) can berepresented as vector in119872-dimensional space In detail eachdocument119889119895 can be represented by a numerical feature vectorx119895

x119895 = [1199081119895 119908119901119895 119908119872119895] (6)

Each element 119908119901119895 of the vector usually represents a word (ora group of words) of the document collection that is the sizeof the vector is defined by the number of words (or groups ofwords) of the complete document collection

The simplest approach is to assign to each document theTerm Frequency and Inverse Document Frequency (TF-IDF)weighting scheme [25 26] The TF-IDF weighting schemeassigns to each term 119901 in the 119895th document a weight 119908119901119895computed as

119908119901119895 = 119905119891119901119895 times log 119873119889119891119901 (7)

where 119905119891119901119895 is the term frequency that is the number of timesthat term 119901 appears in the document 119895 and 119889119891119901 is the numberof documents in the collection which contains term 119901

According to TF-IDF weighting scheme 119908119901119895 is(1) higher when the term 119901 occurs many times within

a small number of documents (thus lending highdiscriminating power to those documents)

(2) lower when the term 119901 occurs fewer times in a doc-ument or occurs in many documents (thus offering aless pronounced relevance signal)

(3) lower when the term 119901 occurs in virtually all docu-ments

Before preprocessing the documents by the TF-IDFweighting scheme The size of the list of terms created fromdocuments can be reduced using methods of stop wordsremoval and stemming [23 24]

In text based document in fact there are a great numberof noninformative words such as articles prepositions andconjunctions called stop words A stop-list is usually builtwith words that should be filtered in the document represen-tation process Words that are to be included in the stop-listare language and task dependent however a set of generalwords can be considered stop words for almost all tasks suchas ldquoandrdquo and ldquoorrdquo Words that appear in very few documentsare also filtered

Another common phase in preprocessing is stemmingwhere the word stem is derived from the occurrence ofa word by removing case and inflection information Forexample ldquocomputesrdquo ldquocomputingrdquo and ldquocomputerrdquo are allmapped to the same stem ldquocomputrdquo Stemming does notalter significantly the information included in documentrepresentation but it does avoid feature expansion

32 SOM Text Clustering Once obtaining the feature vectorx119895 in (6) associated with each text 119889119895 the SOM algorithmdescribed in Section 2 can be applied for text clustering Thetext clustering method explained above is known as ldquoSOMplus VSMrdquo other variants to it have been proposed by Liuet al [27 28] An overview of the application of SOM intext clustering is provided by Liu et al [17] This kind ofclustering method was employed in domains such as patent[29] financial services [30] and public policy analysis [31]

4 The Engineering Change Process inComplex Products Industry

For complex products such as trains automobiles or air-craft engineering changes are unavoidable and productsor components have to be redesigned and retrofitted toaccommodate the new changes to new installations andproducts In these environments an engineering change caninvolve the risk of due time delay Huang et al [32] carriedout a survey about the effects of engineering changes on fourmanufacturing industries and found that the time investedin processing an engineering change varies from 2 to 36person days In Angers [33] it is estimated that more than35 of todayrsquos manufacturing resources are just devoted tomanaging changes to engineering drawings manufacturingplans and scheduling requirements Engineering changeprocesses in complex environments such as automobile trainand aviation industry were also studied by Leng et al [3] andSubrahmanian et al [34]

The phases of a real engineering change process incomplex products industry can be summarized as follows(Figure 2)

(1) A request of an engineering change is made and sentto an engineering manager In this phase standardECR forms are used outlining the reason of thechange the type of the change which components or

Computational Intelligence and Neuroscience 5

Change trigger

(2)Identification of possible

solution(s) to change request

(3)Technical evaluation of the

change

(4)Economic evaluation of the

change

(5)Selection and approval of a

solution

(6)Implementation of solution

(7)Updating as-built documents

(1)Engineering change request

Before approval

During approval

After approval

Figure 2The real engineering change process in complex productsenvironments Adapted from Jarratt et al [5]

systems are likely to be affected the person and thedepartment making the request and so forth

(2) Potential solutions to the request for change areidentified

(3) Technical evaluation of the change is carried out Inthis phase the technical impact of implementing eachsolution is assessed Various factors are consideredfor example the impact upon design and productrequirements production schedule and resources tobe devoted

(4) Economic evaluation of the change is performedThe economic risk of implementing each solutionis assessed In this phase costs related to extraproduction times replacements of materials penaltyfor missed due date and so forth are estimated

(5) Once a particular solution is selected it is approvedor not approved The change is reviewed and a costbenefit analysis is carried out When a solution isapproved the engineering change order is preparedand issued

(6) Implementation of the engineering change and iden-tification of the documents such as drawings are tobe updated

(7) Update of the as-built documents occurs As-builtdocuments are usually the original design documentsrevised to reflect any changes made during the pro-cess that is design changes material changes and soforth

Iterations of the process occur for example when a par-ticular solution has negative impact on product requirementsor is too risky to be implemented so the process returns tophase 2 and another solution is identified Another iterationis possible when the costs of a solution are too high or morerisk analysis is required or when the proposed solution iscompletely refused

As shown in Figure 2 no review process of similarchanges faced in the past is carried out during the processor at the endThis aspect is emphasized by Jarratt et al [5] byhighlighting that after a period of time the change should bereviewed to verify if it achieved what was initially intendedand what lessons can be learned for future change processVarious factors can discourage examining the solutionsadopted in the past to a particular change First of all thereis the lack of opportune methods to analyze the documentscollected during the process that is ECR ECRs are oftenforms containing partswritten in natural languageAnalyzingthese kinds of documents in the design phase of a product ora component or when a new change request occurs could bevery time consuming without an appropriate solution

In this context SOM text clustering application canimprove the process When a new ECR occurs in fact ECRsmanaged in the past and similar to the current request couldbe analyzed in order to evaluate the best solution and toavoid repeating the same mistakes made in the past In orderto explore the existence of similarity between the differentECRs texts the first step is to verify the potential clusteringpresent in the analyzed dataset The application of SOM textclustering to ECR documents is explored in the next section

5 The Use of SOM Text Clusteringfor ECRs Analysis

In order to test SOM text clustering we used a datasetof 119873 = 54 ECRs representing some engineering changesmanaged during the engineering change process of a railwaycompany The dataset included the natural language writtendescriptions of the causes of changes contained in the ECRsforms The documents were written in Italian language andthe VSM in Section 3 was used to generate output vectorsfrom input text documents The number of terms that is thedimension of each vector x119895 associated with each document

6 Computational Intelligence and Neuroscience

05

11

17

Figure 3 U-matrix of the 54 ERCs texts preprocessed throughthe VSM Color scale is related to distances between map unitsRed colors represent large distances and blue colors represent smalldistances

in the dataset after the stop word removal and stemmingprocesses was equal to 119872 = 361

In our work we used the MATLAB software SpecificallyTerm toMatrix Generator (TMG) toolbox [35] for documentpreprocessing and SOM toolbox for SOM text clustering [8]were employed

The map size of the SOM is computed through theheuristic formula in Vesanto et al [8] In detail the numberof neurons is computed as 5radic119873 where 119873 is the numberof training samples Map shape is a rectangular grid withhexagonal lattice The neighborhood function is Gaussianand the map is trained using the batch version of SOM Afterthe map is trained each data vector is mapped to the mostsimilar prototype vector in the map that is the BMU whichresults from the matching step in the SOM algorithm In ourcase the network structure is a 2D-lattice of 7 times 5 hexagonal

A first step in cluster analysis based on SOM is based onvisual inspection through U-matrix and Component Planes

The U-matrix obtained on the application of SOM toECR dataset is shown in Figure 3 In order to representadditional information (ie distances) the SOM map size isaugmented by inserting an additional neuron between eachpair of neurons and reporting the distance between the twoneurons The U-matrix and the different colors linked to thedistance between neurons on the map show that five clustersare present in the dataset Each cluster has been highlightedby a circle

Besides looking at the overall differences in the U-matrixit is interesting as well to look at the differences betweeneach component present in the input vectors meaning thatwe look at differences regarding each component associatedwith a single ldquotermrdquo in the input dataset The total numberof Component Planes obtained from the SOM correspondsto the total number of terms in our dataset that is 119872 =361 For illustrative purpose Figure 4 shows two ComponentPlanes chosen as an example The first Component Plane

Table 1 Class label and number of ECR texts

Class Label Number of ECR texts(1) Metal-Sheet MS 12(2) Carter CR 8(3) Antenna AN 8(4) Semi-Finished Round SR 7(5) Hydraulic Panel HP 9(6) Pneumatic System PS 10

(Figure 4(a)) is associated with the term of index 119901 = 48that is the term ldquoAntennardquo the second one (Figure 4(b)) isrelated to the term of index 119901 = 195 that is the term ldquoMetal-Sheetrdquo The difference between the two terms in the datasetcan be represented by considering for example the map unitin top left corner of the two figures This map unit has highvalues for variable ldquoTerm 119901 = 48rdquo and low values for variableldquoTerm119901 = 195rdquo From the observation of Component Planeswe can conclude that there is no correlation between the twoterms As a matter of fact these two terms were never usedtogether into the same documents

As shown above theU-matrix and theComponent Planesallow obtaining a rough cluster structure of the ECR datasetTo get a finer clustering result the prototype vectors fromthe map were clustered using 119870-means algorithm The bestnumber of clusters in the SOM map grip can be determinedby using the DBI values Figure 5(a) shows the DBI values byvarying the number of clusters 119870 in [2 7] The elbow pointin the graph shows that the optimum number of cluster 119870lowastcorresponding to theminimum value of DBI(119870) is equal to 5The clustering result of SOMobtained by using119870-meanswith119870lowast = 5 clusters is shown in Figure 5(b)The BMUs belongingto each cluster are represented with a different color and ineach hexagon the number of documents associated with eachBMU is provided

51 External Validation of SOM Text Clustering In thereference case study the true classification of each ECR inthe dataset was provided by process operators Thereforethis information was used in our study in order to per-form an external validation of the SOM text clustering Inparticular each ECR text was analyzed and classified withreference to themain component involved in the engineeringchange (namely ldquoMetal-Sheetrdquo ldquoCarterrdquo ldquoAntennardquo ldquoSemi-Finished Roundrdquo ldquoHydraulic Panelrdquo and ldquoPneumatic Sys-temrdquo) Table 1 reports the number of ECRs related to eachcomponent along with the labels used in order to classify theECR documents (namely ldquoMSrdquo ldquoCRrdquo ldquoANrdquo ldquoSRrdquo ldquoHPrdquo andldquoPSrdquo respectively) Although a classification is available in thespecific case study of this paper it is worth noting that oftensuch information may be unavailable

By superimposing the classification information themap grid resulting by the training of the SOM is as inFigure 6 where each hexagon reports the classification labelof documents sharing a given BMU (within brackets thenumber of associated documents) From Figure 6 it can beobserved that the unsupervised clustering produced by theSOMalgorithm is quite coherent with the actual classification

Computational Intelligence and Neuroscience 7

Term p = 48

(a)d

0

05

1

Term p = 195

(b)

Figure 4 Component Planes of two variables chosen as example (a) Component Plane related to the variable of index 119901 = 48 correspondingto the term ldquoAntennardquo (b) Component Plane of the variable of index 119901 = 195 corresponding to the term ldquoMetal-Sheetrdquo (a) and (b) are linkedby position in each figure the hexagon in a certain position corresponds to the same map units Each Component Plane shows the values ofthe variable in each map unit using color-coding Letter ldquodrdquo means that denormalized values are used that is the values in the original datacontext are used [8]

1 2 3 4 5 6 708

09

1

11

12

13

14

15

16

17

18

K

DBI

(K)

(a)

5

3

3

6

1

2

1

2

3

1

1

1

1

2

2

1

1

6

4

3

3

2

(b)

Figure 5 (a) Davies Bouldin Index (DBI) values for number of clusters119870 in [2 7] (b) Clustering result of SOM using119870-means with119870lowast = 5clusters In each hexagon the number of documents sharing the same BMU is provided

given by process operators in fact ECR documents sharingthe same classification label are included in BMUs belongingto the same cluster It is worth noting that the actual labelis assigned to each document after the SOM unsupervisedtraining has been carried out From Figure 6 it can be also

noted that ECRs classified either as ldquoPSrdquo and ldquoHPrdquo are allincluded in a unique cluster Furthermore two documentsout of twelve with label ldquoMSrdquo are assigned to clusters thatinclude different labels namely ldquoCRrdquo ldquoPSrdquo and ldquoHPrdquo Weinvestigated this misclassification and we found that causes

8 Computational Intelligence and Neuroscience

AN(5)

CR(3)

CR(3)

SR(6)

AN(1)

CR(2)

SR(1)

AN(2)

PS(3)

HP(1)

MS(1)

MS(1)

PS(1)

HP(2)

MS(2)

MS(1)

MS(1)

PS(6)

HP(4)

HP(2)MS(1)

MS(3)

MS(2)

Figure 6 Labeled map grid of SOM In each hexagon labels andtotal number of documents sharing the same BMU Coloring refersto 119870-means unsupervised clustering of the SOM

Table 2 Contingency matrix N

MS CR AN SR HP PS 119899119903T1 T2 T3 T4 T5 T61198621 0 0 8 0 0 0 81198622 1 0 0 0 9 10 201198623 1 8 0 0 0 0 91198624 0 0 0 7 0 0 71198625 10 0 0 0 0 0 10119898119896 12 8 8 7 9 10 119873 = 54

of changes described in these two ldquoMSrdquo documents are quitesimilar to those contained in documents labeled as ldquoCRrdquoldquoPSrdquo and ldquoHPrdquo

Given the actual classification the quality of the obtainedclustering can be evaluated by computing four indices purityprecision recall and 119865-measure [36]

Let T = T1 T119903 T119877 be the true partitioninggiven by the process operators where the partition T119903consists of all the documents with label 119903 Let 119898119903 = |T119903|denote the number of documents in true partition T119903 Alsolet C = C1 C119896 C119870 denote the clustering obtainedvia the SOM text clustering algorithm and 119899119896 = |C119896|denote the number of documents in cluster C119896 The 119870 times 119877contingency matrix N induced by clustering C and the truepartitioning T can be obtained by computing the elementsN(119896 119903) = 119899119896119903 = |C119896 cap T119903| where 119899119896119903 denotes the number ofdocuments that are common to clusterC119896 and true partitionT119903 The contingency matrix for SOM text clustering of ECRsis reported in Table 2

Starting fromTable 2 the following indices are computed

(i) Purity Index The cluster-specific purity index ofclusterC119896 is defined as

purity119896 = 1119899119896119877max119903=1

119899119896119903 (8)

The overall purity index of the entire clustering C iscomputed as

purity = 119870sum119896=1

119899119896119873purity119896 = 1119873119870sum119896=1

119877max119903=1

119899119896119903 (9)

As shown in Table 3 clusters C1 C4 and C5 havepurity index equal to 1 that is they contain entitiesfrom only one partition Clusters C2 and C3 gatherentities from different partitions that isT1T5 andT6 for cluster C2 and T1 T2 for cluster C3 Theoverall purity index is equal to 079

(ii) Precision Index Given a clusterC119896 letT119903119896 denote themajority partition that contains the maximum num-ber of documents fromC119896 that is 119903119896 = argmax119877119903=1119899119896119903The precision index of a clusterC119896 is given by

prec119896 = 1119899119896119877max119903=1

119899119896119903 = 119899119896119903119896119899119896 (10)

For clustering in Table 2 the majority partitionsare T31 T62 T23 T44 and T15 Precision indicesshow that all documents gathered in clusters C1C4 and C5 belong to the corresponding majoritypartitionsT31 T44 andT15 For clusterC2 the 50of documents belong to T62 and finally the 88 ofdocuments in clusterC3 belong toT23

(iii) Recall Index Given a clusterC119896 it is defined as

recall119896 = 11989911989611990311989610038161003816100381610038161003816T119903119896 10038161003816100381610038161003816 = 119899119896119903119896119898119903119896 (11)

where 119898119903119896 = |T119903119896 | It measures the fraction ofdocuments in partition T119903119896 shared in common withclusterC119896The recall indices reported in Table 3 showthat clusters C1 C2 C3 and C4 shared in commonthe 100 of documents in majority partitions T31 T62 T23 andT44 respectively ClusterC5 shared the83 of documents inT15

(iv) 119865-Measure Index It is the harmonic mean of theprecision and recall values for each cluster The 119865-measure for cluster 119862119896 is therefore given as

119865119896 = 2 sdot prec119896 sdot recall119896prec119896 + recall119896

= 2119899119896119903119896119899119896 + 119898119903119896 (12)

The overall 119865-measure for the clustering C is themean of the clusterwise 119865-measure values

119865 = 1119870119870sum119896=1

119865119896 (13)

Table 3 shows that 119865-measure of clusters C1 and C4is equal to 1 while other values are less than 1 Thelow values of 119865-measures for clusters C2 C3 andC5 depend on a low precision index for clusters C2and C3 and on a low recall index for cluster C5Consequently the overall 119865-measure is equal to 090

Computational Intelligence and Neuroscience 9

Table 3 Clustering validation

Cluster-specific and overall indicesPurity

purity1 = 1 purity2 = 05 purity3 = 089 purity4 = 1 purity5 = 1purity = 079Precision

prec1 = 1 prec2 = 05 prec3 = 089 prec4 = 1 prec5 = 1Recall

recall1 = 1 recall2 = 1 recall3 = 1 recall4 = 1 recall5 = 083F-measure

1198651 = 1 1198652 = 066 1198653 = 094 1198654 = 1 1198655 = 091119865 = 090Table 4 Results of SOM-based classification In the first and in the second column labels and number of ECR texts of the actual classificationare reported In the third column the number of ECRs correctly classified through the label of first associated BMU In the fourth column thenumber of document associated with an empty first BMU but correctly classified by considering the label of documents sharing the secondassociated BMU

Labels Number of ECRs Number of ECRs classified through the 1st BMU Number of ECRs classified through the 2nd BMUMS 12 10 2CR 8 6 2AN 8 7 1SR 7 6 1HP 9 6 3PS 10 10 mdash

Given the actual classification the SOM can be furthervalidated through a leave-one-out cross validation techniquein order to check its classification ability In particular119873 minus 1ECR documents are used for training and the remaining onefor testing (iterating until each ECR text in the data has beenused for testing)

At each iteration once SOM has been trained on 119873 minus 1ECRs and when the testing sample is presented as input aBMU is selected in the matching step of the SOM algorithmThe label of the training documents associatedwith that BMUis considered In the case of an empty BMU that is whichresults are not associated with any training documents theclosest one associated with at least one training document isconsidered instead while in case of a BMU associated withtraining documents with more than one label the label withgreater number of documents is considered

Table 4 shows the results of the leave-one-out cross vali-dation For each row that is for a given ECR label the secondcolumn reports the total number of documents in the datasetwhile the last two columns report the number of testingECRs correctly classified by the SOM In particular thethird column reports the number of testing ECRs correctlyclassified as they were connected to a first BMU associatedwith training documents with the same labelThe last columnrefers to the number of ECRs associated with an empty firstBMU that nevertheless resulted closest to a second BMUrelated to documents belonging to the same class of thetesting sample Also cross validation study demonstrates thatlabels given by SOM are coherent with actual classificationand confirms the ability of SOM as classification tool

6 Conclusions

In this paper a real case study regarding the engineeringchange process in complex products industry was conductedThe study concerned the postchange stage of the engineeringchange process in which past engineering changes dataare analyzed to discover information exploitable in newengineering changes In particular SOM was used for clus-tering of natural language written texts produced during theengineering change process The analyzed texts included thedescriptions of the causes of changes contained in the ECRforms Firstly SOM algorithm was used as clustering tool tofind relationships between the ECR texts and to cluster themaccordingly Subsequently SOM was tested as classificationtool and the results were validated through a leave-one-outcross validation technique

The results of the real case study showed that the use of theSOM text clustering can be an effective tool in improvementof the engineering change process analysis In particularsome of the advantages highlighted in this study are asfollows

(1) Text mining methods allow analyzing unstructureddata and deriving high-quality informationThemaindifficulty in ECR analysis consisted in analyzingnatural language written texts

(2) Clustering analysis of past ECRs stored in the com-pany allows automatically gathering ECRs on thebasis of similarity between documents When a newchange triggers the company can quickly focus on

10 Computational Intelligence and Neuroscience

the cluster of interest Clustering can support thecompany to know if a similar change was alreadymanaged in the past to analyze the best solutionadopted and to learn avoiding the same mistakesmade in the past

(3) Use of SOM for ECRs text clustering allows automat-ically organizing large documents collection Withrespect to other clustering algorithms the mainadvantage of SOM text clustering is that the similarityof the texts is preserved in the spatial organization ofthe neurons The distance among prototypes in theSOMmap can therefore be considered as an estimateof the similarity between documents belonging toclusters In addition SOM can first be computedusing a representative subset of old input data Newinput can be mapped straight into the most similarmodel without recomputing the whole mapping

Nevertheless the study showed some limitations of theapplication of SOM text clustering and classification A firstlimitation is linked to natural language written texts Theterms contained in different texts may be similar even if anengineering change request concerns a different productThesimilarity of terms may influence the performance of SOM-based clustering A second limitation is linked to the use ofSOMas classificationmethod Classification indeed requiresthe labeling of a training datasetThis activity requires a deepknowledge of the different kinds of ECRs managed duringthe engineering change process andmay be difficult and timeconsuming

As a summary research on use of SOM text clusteringin engineering change process analysis appears to be apromising direction for further research A future directionof the work will consider the use of SOM text clustering ona larger dataset of ECRs comparing SOM with other clus-tering algorithms such as119870-means or hierarchical clusteringmethods Another direction of future research concerns theanalysis of SOM robustness to parameters selection (iethe size and structure of the map parameters and kinds oflearning and neighborhood functions)

Symbols

120572(119905) Scalar-valued learning-rate factorN Contingency matrix120575(C119896C119897) Between-cluster distanceΔ(C119896) Within-cluster distanceΔm119894(119905) Adjustment of the prototype vectorm119894(119905)C Collection of documentsC119896 Cluster of documents of index 119896T True partition of documents given by

process operatorsT119903 True partition of documents with label 119903T119903119896 Majority partition containing the

maximum number of documents fromC119896120590(119905) Width of the kernel which corresponds tothe radius of 119873119888(119905)119888(x119895) Index of the BMU associated with x119895119889119895 Document of index 119895

DBI(119870) Davies Bouldin Index for 119870 clusters119889119891119901 Number of documents in the collectionCwhich contains term 119901ℎ119888119894(119905) Neighborhood function119870 Total number of clusters119870lowast Optimum number of clusters119872 Total number of terms in the documentscollection119898119903 Number of documents in partitionT119903119873 Total number of processed documents119873119888(119905) Neighborhood around node 119888119899119896 Number of documents in clusterC119896119899119896119903 Number of documents common to clusterC119896 and partitionT119903119875 Total number of units of SOM119905 Index of time119905119891119901119895 Term frequency of term 119901 in the 119895thdocument119908119901119895 Weight associated with term 119901 in the 119895thdocument

m119888 Prototype vector associated with the BMUm119894 Prototype vector of index 119894r119888 Position vector of index 119888r119894 Position vector of index 119894x119895 Numerical feature vector associated with119895th document

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research has been funded by MIUR ITIA CNR andCluster Fabbrica Intelligente (CFI) Sustainable Manufactur-ing project (CTN01 00163 148175) and Vis4Factory project(Sistemi Informativi Visuali per i processi di fabbrica nelsettore dei trasporti PON02 00634 3551288)The authors arethankful to the technical and managerial staff of MerMecSpA (Italy) for providing the industrial case study of thisresearch

References

[1] T Jarratt J Clarkson and C E Eckert Design Process Improve-ment Springer New York NY USA 2005

[2] C Wanstrom and P Jonsson ldquoThe impact of engineeringchanges on materials planningrdquo Journal of Manufacturing Tech-nology Management vol 17 no 5 pp 561ndash584 2006

[3] S Leng L Wang G Chen and D Tang ldquoEngineering changeinformation propagation in aviation industrial manufacturingexecution processesrdquo The International Journal of AdvancedManufacturing Technology vol 83 no 1 pp 575ndash585 2016

[4] W-H Wu L-C Fang W-Y Wang M-C Yu and H-Y KaoldquoAn advanced CMII-based engineering change managementframework the integration of PLM and ERP perspectivesrdquoInternational Journal of Production Research vol 52 no 20 pp6092ndash6109 2014

Computational Intelligence and Neuroscience 11

[5] T A W Jarratt C M Eckert N H M Caldwell and P JClarkson ldquoEngineering change an overview and perspective onthe literaturerdquo Research in Engineering Design vol 22 no 2 pp103ndash124 2011

[6] B Hamraz N H M Caldwell and P J Clarkson ldquoA holisticcategorization framework for literature on engineering changemanagementrdquo Systems Engineering vol 16 no 4 pp 473ndash5052013

[7] H Ritter T Martinetz and K Schulten Neural Computationand Self-Organizing Maps An Introduction Addison-WesleyNew York NY USA 1992

[8] J Vesanto J Himberg E Alhoniemi and J Parhankangas SOMtoolbox for Matlab 5 2005 httpwwwcishutfisomtoolbox

[9] E Fricke BGebhardHNegele andE Igenbergs ldquoCopingwithchanges causes findings and strategiesrdquo Systems Engineeringvol 3 no 4 pp 169ndash179 2000

[10] J Vesanto and E Alhoniemi ldquoClustering of the self-organizingmaprdquo IEEE Transactions on Neural Networks vol 11 no 3 pp586ndash600 2000

[11] A Sharafi P Wolf and H Krcmar ldquoKnowledge discovery indatabases on the example of engineering change managementrdquoin Proceedings of the Industrial Conference on Data Mining-Poster and Industry Proceedings pp 9ndash16 IBaI July 2010

[12] F Elezi A Sharafi A Mirson P Wolf H Krcmar and ULindemann ldquoA Knowledge Discovery in Databases (KDD)approach for extracting causes of iterations in EngineeringChange Ordersrdquo in Proceedings of the ASME InternationalDesign Engineering Technical Conferences and Computers andInformation in Engineering Conference pp 1401ndash1410 2011

[13] A Sharafi Knowledge Discovery in Databases an Analysis ofChange Management in Product Development Springer 2013

[14] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo Biological Cybernetics vol 43 no 1 pp 59ndash691982

[15] T Kohonen ldquoEssentials of the self-organizing maprdquo NeuralNetworks vol 37 pp 52ndash65 2013

[16] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[17] Y-C Liu M Liu and X-L Wang ldquoApplication of self-organizing maps in text clustering a reviewrdquo in Applicationsof Self-Organizing Maps M Johnsson Ed chapter 10 InTechRijeka Croatia 2012

[18] A Ultsch and H P Siemon ldquoKohonenrsquos self organizing featuremaps for exploratory data analysisrdquo in Proceedings of theProceedings of International Neural Networks Conference (INNCrsquo90) pp 305ndash308 Kluwer Paris France 1990

[19] J Vesanto ldquoSOM-based data visualization methodsrdquo IntelligentData Analysis vol 3 no 2 pp 111ndash126 1999

[20] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 1967

[21] D L Davies and DW Bouldin ldquoA cluster separation measurerdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 1 no 2 pp 224ndash227 1979

[22] N Yorek I Ugulu and H Aydin ldquoUsing self-organizingneural network map combined with Wardrsquos clustering algo-rithm for visualization of studentsrsquo cognitive structural modelsabout aliveness conceptrdquoComputational Intelligence andNeuro-science vol 2016 Article ID 2476256 14 pages 2016

[23] T Honkela S Kaski K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo in Proceedingsof theWork shop on Self-OrganizingMaps (WSOM rsquo97) pp 310ndash315 Espoo Finland June 1997

[24] S Kaski T Honkela K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo Neurocomput-ing vol 21 no 1ndash3 pp 101ndash117 1998

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information ProcessingampManagement vol24 no 5 pp 513ndash523 1988

[26] G Salton and R K Waldstein ldquoTerm relevance weights in on-line information retrievalrdquo Information Processing amp Manage-ment vol 14 no 1 pp 29ndash35 1978

[27] Y C Liu X Wang and C Wu ldquoConSOM a conceptional self-organizingmapmodel for text clusteringrdquoNeurocomputing vol71 no 4ndash6 pp 857ndash862 2008

[28] Y-C Liu C Wu and M Liu ldquoResearch of fast SOM clusteringfor text informationrdquo Expert Systems with Applications vol 38no 8 pp 9325ndash9333 2011

[29] T Kohonen S Kaski K Lagus et al ldquoSelf organization of amassive document collectionrdquo IEEE Transactions on NeuralNetworks vol 11 no 3 pp 574ndash585 2000

[30] J Zhu and S Liu ldquoSOMnetwork based clustering analysis of realestate enterprisesrdquo American Journal of Industrial and BusinessManagement vol 4 no 3 pp 167ndash173 2014

[31] B C Till J Longo A R Dobell and P F Driessen ldquoSelf-organizing maps for latent semantic analysis of free-form textin support of public policy analysisrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 1pp 71ndash86 2014

[32] G Q Huang W Y Yee and K L Mak ldquoCurrent practice ofengineering changemanagement in Hong Kongmanufacturingindustriesrdquo Journal of Materials Processing Technology vol 139no 1ndash3 pp 481ndash487 2003

[33] S Angers ldquoChanging the rules of the lsquoChangersquo gamemdashem-ployees and suppliers work together to transform the 737757production systemrdquo 2002 httpwwwboeingcomnewsfron-tiersarchive2002mayi ca2html

[34] E Subrahmanian C Lee and H Granger ldquoManaging andsupporting product life cycle through engineering changemanagement for a complex productrdquo Research in EngineeringDesign vol 26 no 3 pp 189ndash217 2015

[35] D Zeimpekis and E Gallopoulos TMG A Matlab toolboxfor generating term-document matrices from text collections2006 httpscgroup20ceidupatrasgr8000tmg

[36] M J Zaki and W Meira Jr Data Mining and AnalysisFundamental Concepts and Algorithms Cambridge UniversityPress 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

4 Computational Intelligence and Neuroscience

where 119870 is the number of clusters Δ(C119896) and Δ(C119897) and120575(C119896C119897) the within-cluster and between-cluster distancesrespectively The optimum number of clusters 119870lowast corre-sponds to the minimum value of DBI(119870) SOM neuralnetwork combined with other clustering algorithms was usedin Yorek et al [22] for visualization of studentsrsquo cognitivestructural models

3 SOM-Based Text Clustering

Text clustering is an unsupervised process used to separatea document collection into some clusters on the basis of thesimilarity relationship between documents in the collection[17] Suppose C = 1198891 119889119895 119889119873 be a collection of 119873documents to be clustered The purpose of text clustering isto divide C into C1 C119896 C119870 clusters with C1 sdot sdot sdot cupC119896 sdot sdot sdot cup C119870 = C

SOM text clustering can be divided into two main phases[23 24] The first phase is document preprocessing whichconsists in using Vector Space Model (VSM) to generateoutput document vectors from input text documents Thesecond one is document clustering that applies SOM on thegenerated document vectors to obtain output clusters

31 Document Preprocessing An important preprocessingaspect for text clustering is to consider how the text contentcan be represented in the form of mathematical expressionfor further analysis and processing

By means of VSM each 119889119895 (119895 = 1 119873) can berepresented as vector in119872-dimensional space In detail eachdocument119889119895 can be represented by a numerical feature vectorx119895

x119895 = [1199081119895 119908119901119895 119908119872119895] (6)

Each element 119908119901119895 of the vector usually represents a word (ora group of words) of the document collection that is the sizeof the vector is defined by the number of words (or groups ofwords) of the complete document collection

The simplest approach is to assign to each document theTerm Frequency and Inverse Document Frequency (TF-IDF)weighting scheme [25 26] The TF-IDF weighting schemeassigns to each term 119901 in the 119895th document a weight 119908119901119895computed as

119908119901119895 = 119905119891119901119895 times log 119873119889119891119901 (7)

where 119905119891119901119895 is the term frequency that is the number of timesthat term 119901 appears in the document 119895 and 119889119891119901 is the numberof documents in the collection which contains term 119901

According to TF-IDF weighting scheme 119908119901119895 is(1) higher when the term 119901 occurs many times within

a small number of documents (thus lending highdiscriminating power to those documents)

(2) lower when the term 119901 occurs fewer times in a doc-ument or occurs in many documents (thus offering aless pronounced relevance signal)

(3) lower when the term 119901 occurs in virtually all docu-ments

Before preprocessing the documents by the TF-IDFweighting scheme The size of the list of terms created fromdocuments can be reduced using methods of stop wordsremoval and stemming [23 24]

In text based document in fact there are a great numberof noninformative words such as articles prepositions andconjunctions called stop words A stop-list is usually builtwith words that should be filtered in the document represen-tation process Words that are to be included in the stop-listare language and task dependent however a set of generalwords can be considered stop words for almost all tasks suchas ldquoandrdquo and ldquoorrdquo Words that appear in very few documentsare also filtered

Another common phase in preprocessing is stemmingwhere the word stem is derived from the occurrence ofa word by removing case and inflection information Forexample ldquocomputesrdquo ldquocomputingrdquo and ldquocomputerrdquo are allmapped to the same stem ldquocomputrdquo Stemming does notalter significantly the information included in documentrepresentation but it does avoid feature expansion

32 SOM Text Clustering Once obtaining the feature vectorx119895 in (6) associated with each text 119889119895 the SOM algorithmdescribed in Section 2 can be applied for text clustering Thetext clustering method explained above is known as ldquoSOMplus VSMrdquo other variants to it have been proposed by Liuet al [27 28] An overview of the application of SOM intext clustering is provided by Liu et al [17] This kind ofclustering method was employed in domains such as patent[29] financial services [30] and public policy analysis [31]

4 The Engineering Change Process inComplex Products Industry

For complex products such as trains automobiles or air-craft engineering changes are unavoidable and productsor components have to be redesigned and retrofitted toaccommodate the new changes to new installations andproducts In these environments an engineering change caninvolve the risk of due time delay Huang et al [32] carriedout a survey about the effects of engineering changes on fourmanufacturing industries and found that the time investedin processing an engineering change varies from 2 to 36person days In Angers [33] it is estimated that more than35 of todayrsquos manufacturing resources are just devoted tomanaging changes to engineering drawings manufacturingplans and scheduling requirements Engineering changeprocesses in complex environments such as automobile trainand aviation industry were also studied by Leng et al [3] andSubrahmanian et al [34]

The phases of a real engineering change process incomplex products industry can be summarized as follows(Figure 2)

(1) A request of an engineering change is made and sentto an engineering manager In this phase standardECR forms are used outlining the reason of thechange the type of the change which components or

Computational Intelligence and Neuroscience 5

Change trigger

(2)Identification of possible

solution(s) to change request

(3)Technical evaluation of the

change

(4)Economic evaluation of the

change

(5)Selection and approval of a

solution

(6)Implementation of solution

(7)Updating as-built documents

(1)Engineering change request

Before approval

During approval

After approval

Figure 2The real engineering change process in complex productsenvironments Adapted from Jarratt et al [5]

systems are likely to be affected the person and thedepartment making the request and so forth

(2) Potential solutions to the request for change areidentified

(3) Technical evaluation of the change is carried out Inthis phase the technical impact of implementing eachsolution is assessed Various factors are consideredfor example the impact upon design and productrequirements production schedule and resources tobe devoted

(4) Economic evaluation of the change is performedThe economic risk of implementing each solutionis assessed In this phase costs related to extraproduction times replacements of materials penaltyfor missed due date and so forth are estimated

(5) Once a particular solution is selected it is approvedor not approved The change is reviewed and a costbenefit analysis is carried out When a solution isapproved the engineering change order is preparedand issued

(6) Implementation of the engineering change and iden-tification of the documents such as drawings are tobe updated

(7) Update of the as-built documents occurs As-builtdocuments are usually the original design documentsrevised to reflect any changes made during the pro-cess that is design changes material changes and soforth

Iterations of the process occur for example when a par-ticular solution has negative impact on product requirementsor is too risky to be implemented so the process returns tophase 2 and another solution is identified Another iterationis possible when the costs of a solution are too high or morerisk analysis is required or when the proposed solution iscompletely refused

As shown in Figure 2 no review process of similarchanges faced in the past is carried out during the processor at the endThis aspect is emphasized by Jarratt et al [5] byhighlighting that after a period of time the change should bereviewed to verify if it achieved what was initially intendedand what lessons can be learned for future change processVarious factors can discourage examining the solutionsadopted in the past to a particular change First of all thereis the lack of opportune methods to analyze the documentscollected during the process that is ECR ECRs are oftenforms containing partswritten in natural languageAnalyzingthese kinds of documents in the design phase of a product ora component or when a new change request occurs could bevery time consuming without an appropriate solution

In this context SOM text clustering application canimprove the process When a new ECR occurs in fact ECRsmanaged in the past and similar to the current request couldbe analyzed in order to evaluate the best solution and toavoid repeating the same mistakes made in the past In orderto explore the existence of similarity between the differentECRs texts the first step is to verify the potential clusteringpresent in the analyzed dataset The application of SOM textclustering to ECR documents is explored in the next section

5 The Use of SOM Text Clusteringfor ECRs Analysis

In order to test SOM text clustering we used a datasetof 119873 = 54 ECRs representing some engineering changesmanaged during the engineering change process of a railwaycompany The dataset included the natural language writtendescriptions of the causes of changes contained in the ECRsforms The documents were written in Italian language andthe VSM in Section 3 was used to generate output vectorsfrom input text documents The number of terms that is thedimension of each vector x119895 associated with each document

6 Computational Intelligence and Neuroscience

05

11

17

Figure 3 U-matrix of the 54 ERCs texts preprocessed throughthe VSM Color scale is related to distances between map unitsRed colors represent large distances and blue colors represent smalldistances

in the dataset after the stop word removal and stemmingprocesses was equal to 119872 = 361

In our work we used the MATLAB software SpecificallyTerm toMatrix Generator (TMG) toolbox [35] for documentpreprocessing and SOM toolbox for SOM text clustering [8]were employed

The map size of the SOM is computed through theheuristic formula in Vesanto et al [8] In detail the numberof neurons is computed as 5radic119873 where 119873 is the numberof training samples Map shape is a rectangular grid withhexagonal lattice The neighborhood function is Gaussianand the map is trained using the batch version of SOM Afterthe map is trained each data vector is mapped to the mostsimilar prototype vector in the map that is the BMU whichresults from the matching step in the SOM algorithm In ourcase the network structure is a 2D-lattice of 7 times 5 hexagonal

A first step in cluster analysis based on SOM is based onvisual inspection through U-matrix and Component Planes

The U-matrix obtained on the application of SOM toECR dataset is shown in Figure 3 In order to representadditional information (ie distances) the SOM map size isaugmented by inserting an additional neuron between eachpair of neurons and reporting the distance between the twoneurons The U-matrix and the different colors linked to thedistance between neurons on the map show that five clustersare present in the dataset Each cluster has been highlightedby a circle

Besides looking at the overall differences in the U-matrixit is interesting as well to look at the differences betweeneach component present in the input vectors meaning thatwe look at differences regarding each component associatedwith a single ldquotermrdquo in the input dataset The total numberof Component Planes obtained from the SOM correspondsto the total number of terms in our dataset that is 119872 =361 For illustrative purpose Figure 4 shows two ComponentPlanes chosen as an example The first Component Plane

Table 1 Class label and number of ECR texts

Class Label Number of ECR texts(1) Metal-Sheet MS 12(2) Carter CR 8(3) Antenna AN 8(4) Semi-Finished Round SR 7(5) Hydraulic Panel HP 9(6) Pneumatic System PS 10

(Figure 4(a)) is associated with the term of index 119901 = 48that is the term ldquoAntennardquo the second one (Figure 4(b)) isrelated to the term of index 119901 = 195 that is the term ldquoMetal-Sheetrdquo The difference between the two terms in the datasetcan be represented by considering for example the map unitin top left corner of the two figures This map unit has highvalues for variable ldquoTerm 119901 = 48rdquo and low values for variableldquoTerm119901 = 195rdquo From the observation of Component Planeswe can conclude that there is no correlation between the twoterms As a matter of fact these two terms were never usedtogether into the same documents

As shown above theU-matrix and theComponent Planesallow obtaining a rough cluster structure of the ECR datasetTo get a finer clustering result the prototype vectors fromthe map were clustered using 119870-means algorithm The bestnumber of clusters in the SOM map grip can be determinedby using the DBI values Figure 5(a) shows the DBI values byvarying the number of clusters 119870 in [2 7] The elbow pointin the graph shows that the optimum number of cluster 119870lowastcorresponding to theminimum value of DBI(119870) is equal to 5The clustering result of SOMobtained by using119870-meanswith119870lowast = 5 clusters is shown in Figure 5(b)The BMUs belongingto each cluster are represented with a different color and ineach hexagon the number of documents associated with eachBMU is provided

51 External Validation of SOM Text Clustering In thereference case study the true classification of each ECR inthe dataset was provided by process operators Thereforethis information was used in our study in order to per-form an external validation of the SOM text clustering Inparticular each ECR text was analyzed and classified withreference to themain component involved in the engineeringchange (namely ldquoMetal-Sheetrdquo ldquoCarterrdquo ldquoAntennardquo ldquoSemi-Finished Roundrdquo ldquoHydraulic Panelrdquo and ldquoPneumatic Sys-temrdquo) Table 1 reports the number of ECRs related to eachcomponent along with the labels used in order to classify theECR documents (namely ldquoMSrdquo ldquoCRrdquo ldquoANrdquo ldquoSRrdquo ldquoHPrdquo andldquoPSrdquo respectively) Although a classification is available in thespecific case study of this paper it is worth noting that oftensuch information may be unavailable

By superimposing the classification information themap grid resulting by the training of the SOM is as inFigure 6 where each hexagon reports the classification labelof documents sharing a given BMU (within brackets thenumber of associated documents) From Figure 6 it can beobserved that the unsupervised clustering produced by theSOMalgorithm is quite coherent with the actual classification

Computational Intelligence and Neuroscience 7

Term p = 48

(a)d

0

05

1

Term p = 195

(b)

Figure 4 Component Planes of two variables chosen as example (a) Component Plane related to the variable of index 119901 = 48 correspondingto the term ldquoAntennardquo (b) Component Plane of the variable of index 119901 = 195 corresponding to the term ldquoMetal-Sheetrdquo (a) and (b) are linkedby position in each figure the hexagon in a certain position corresponds to the same map units Each Component Plane shows the values ofthe variable in each map unit using color-coding Letter ldquodrdquo means that denormalized values are used that is the values in the original datacontext are used [8]

1 2 3 4 5 6 708

09

1

11

12

13

14

15

16

17

18

K

DBI

(K)

(a)

5

3

3

6

1

2

1

2

3

1

1

1

1

2

2

1

1

6

4

3

3

2

(b)

Figure 5 (a) Davies Bouldin Index (DBI) values for number of clusters119870 in [2 7] (b) Clustering result of SOM using119870-means with119870lowast = 5clusters In each hexagon the number of documents sharing the same BMU is provided

given by process operators in fact ECR documents sharingthe same classification label are included in BMUs belongingto the same cluster It is worth noting that the actual labelis assigned to each document after the SOM unsupervisedtraining has been carried out From Figure 6 it can be also

noted that ECRs classified either as ldquoPSrdquo and ldquoHPrdquo are allincluded in a unique cluster Furthermore two documentsout of twelve with label ldquoMSrdquo are assigned to clusters thatinclude different labels namely ldquoCRrdquo ldquoPSrdquo and ldquoHPrdquo Weinvestigated this misclassification and we found that causes

8 Computational Intelligence and Neuroscience

AN(5)

CR(3)

CR(3)

SR(6)

AN(1)

CR(2)

SR(1)

AN(2)

PS(3)

HP(1)

MS(1)

MS(1)

PS(1)

HP(2)

MS(2)

MS(1)

MS(1)

PS(6)

HP(4)

HP(2)MS(1)

MS(3)

MS(2)

Figure 6 Labeled map grid of SOM In each hexagon labels andtotal number of documents sharing the same BMU Coloring refersto 119870-means unsupervised clustering of the SOM

Table 2 Contingency matrix N

MS CR AN SR HP PS 119899119903T1 T2 T3 T4 T5 T61198621 0 0 8 0 0 0 81198622 1 0 0 0 9 10 201198623 1 8 0 0 0 0 91198624 0 0 0 7 0 0 71198625 10 0 0 0 0 0 10119898119896 12 8 8 7 9 10 119873 = 54

of changes described in these two ldquoMSrdquo documents are quitesimilar to those contained in documents labeled as ldquoCRrdquoldquoPSrdquo and ldquoHPrdquo

Given the actual classification the quality of the obtainedclustering can be evaluated by computing four indices purityprecision recall and 119865-measure [36]

Let T = T1 T119903 T119877 be the true partitioninggiven by the process operators where the partition T119903consists of all the documents with label 119903 Let 119898119903 = |T119903|denote the number of documents in true partition T119903 Alsolet C = C1 C119896 C119870 denote the clustering obtainedvia the SOM text clustering algorithm and 119899119896 = |C119896|denote the number of documents in cluster C119896 The 119870 times 119877contingency matrix N induced by clustering C and the truepartitioning T can be obtained by computing the elementsN(119896 119903) = 119899119896119903 = |C119896 cap T119903| where 119899119896119903 denotes the number ofdocuments that are common to clusterC119896 and true partitionT119903 The contingency matrix for SOM text clustering of ECRsis reported in Table 2

Starting fromTable 2 the following indices are computed

(i) Purity Index The cluster-specific purity index ofclusterC119896 is defined as

purity119896 = 1119899119896119877max119903=1

119899119896119903 (8)

The overall purity index of the entire clustering C iscomputed as

purity = 119870sum119896=1

119899119896119873purity119896 = 1119873119870sum119896=1

119877max119903=1

119899119896119903 (9)

As shown in Table 3 clusters C1 C4 and C5 havepurity index equal to 1 that is they contain entitiesfrom only one partition Clusters C2 and C3 gatherentities from different partitions that isT1T5 andT6 for cluster C2 and T1 T2 for cluster C3 Theoverall purity index is equal to 079

(ii) Precision Index Given a clusterC119896 letT119903119896 denote themajority partition that contains the maximum num-ber of documents fromC119896 that is 119903119896 = argmax119877119903=1119899119896119903The precision index of a clusterC119896 is given by

prec119896 = 1119899119896119877max119903=1

119899119896119903 = 119899119896119903119896119899119896 (10)

For clustering in Table 2 the majority partitionsare T31 T62 T23 T44 and T15 Precision indicesshow that all documents gathered in clusters C1C4 and C5 belong to the corresponding majoritypartitionsT31 T44 andT15 For clusterC2 the 50of documents belong to T62 and finally the 88 ofdocuments in clusterC3 belong toT23

(iii) Recall Index Given a clusterC119896 it is defined as

recall119896 = 11989911989611990311989610038161003816100381610038161003816T119903119896 10038161003816100381610038161003816 = 119899119896119903119896119898119903119896 (11)

where 119898119903119896 = |T119903119896 | It measures the fraction ofdocuments in partition T119903119896 shared in common withclusterC119896The recall indices reported in Table 3 showthat clusters C1 C2 C3 and C4 shared in commonthe 100 of documents in majority partitions T31 T62 T23 andT44 respectively ClusterC5 shared the83 of documents inT15

(iv) 119865-Measure Index It is the harmonic mean of theprecision and recall values for each cluster The 119865-measure for cluster 119862119896 is therefore given as

119865119896 = 2 sdot prec119896 sdot recall119896prec119896 + recall119896

= 2119899119896119903119896119899119896 + 119898119903119896 (12)

The overall 119865-measure for the clustering C is themean of the clusterwise 119865-measure values

119865 = 1119870119870sum119896=1

119865119896 (13)

Table 3 shows that 119865-measure of clusters C1 and C4is equal to 1 while other values are less than 1 Thelow values of 119865-measures for clusters C2 C3 andC5 depend on a low precision index for clusters C2and C3 and on a low recall index for cluster C5Consequently the overall 119865-measure is equal to 090

Computational Intelligence and Neuroscience 9

Table 3 Clustering validation

Cluster-specific and overall indicesPurity

purity1 = 1 purity2 = 05 purity3 = 089 purity4 = 1 purity5 = 1purity = 079Precision

prec1 = 1 prec2 = 05 prec3 = 089 prec4 = 1 prec5 = 1Recall

recall1 = 1 recall2 = 1 recall3 = 1 recall4 = 1 recall5 = 083F-measure

1198651 = 1 1198652 = 066 1198653 = 094 1198654 = 1 1198655 = 091119865 = 090Table 4 Results of SOM-based classification In the first and in the second column labels and number of ECR texts of the actual classificationare reported In the third column the number of ECRs correctly classified through the label of first associated BMU In the fourth column thenumber of document associated with an empty first BMU but correctly classified by considering the label of documents sharing the secondassociated BMU

Labels Number of ECRs Number of ECRs classified through the 1st BMU Number of ECRs classified through the 2nd BMUMS 12 10 2CR 8 6 2AN 8 7 1SR 7 6 1HP 9 6 3PS 10 10 mdash

Given the actual classification the SOM can be furthervalidated through a leave-one-out cross validation techniquein order to check its classification ability In particular119873 minus 1ECR documents are used for training and the remaining onefor testing (iterating until each ECR text in the data has beenused for testing)

At each iteration once SOM has been trained on 119873 minus 1ECRs and when the testing sample is presented as input aBMU is selected in the matching step of the SOM algorithmThe label of the training documents associatedwith that BMUis considered In the case of an empty BMU that is whichresults are not associated with any training documents theclosest one associated with at least one training document isconsidered instead while in case of a BMU associated withtraining documents with more than one label the label withgreater number of documents is considered

Table 4 shows the results of the leave-one-out cross vali-dation For each row that is for a given ECR label the secondcolumn reports the total number of documents in the datasetwhile the last two columns report the number of testingECRs correctly classified by the SOM In particular thethird column reports the number of testing ECRs correctlyclassified as they were connected to a first BMU associatedwith training documents with the same labelThe last columnrefers to the number of ECRs associated with an empty firstBMU that nevertheless resulted closest to a second BMUrelated to documents belonging to the same class of thetesting sample Also cross validation study demonstrates thatlabels given by SOM are coherent with actual classificationand confirms the ability of SOM as classification tool

6 Conclusions

In this paper a real case study regarding the engineeringchange process in complex products industry was conductedThe study concerned the postchange stage of the engineeringchange process in which past engineering changes dataare analyzed to discover information exploitable in newengineering changes In particular SOM was used for clus-tering of natural language written texts produced during theengineering change process The analyzed texts included thedescriptions of the causes of changes contained in the ECRforms Firstly SOM algorithm was used as clustering tool tofind relationships between the ECR texts and to cluster themaccordingly Subsequently SOM was tested as classificationtool and the results were validated through a leave-one-outcross validation technique

The results of the real case study showed that the use of theSOM text clustering can be an effective tool in improvementof the engineering change process analysis In particularsome of the advantages highlighted in this study are asfollows

(1) Text mining methods allow analyzing unstructureddata and deriving high-quality informationThemaindifficulty in ECR analysis consisted in analyzingnatural language written texts

(2) Clustering analysis of past ECRs stored in the com-pany allows automatically gathering ECRs on thebasis of similarity between documents When a newchange triggers the company can quickly focus on

10 Computational Intelligence and Neuroscience

the cluster of interest Clustering can support thecompany to know if a similar change was alreadymanaged in the past to analyze the best solutionadopted and to learn avoiding the same mistakesmade in the past

(3) Use of SOM for ECRs text clustering allows automat-ically organizing large documents collection Withrespect to other clustering algorithms the mainadvantage of SOM text clustering is that the similarityof the texts is preserved in the spatial organization ofthe neurons The distance among prototypes in theSOMmap can therefore be considered as an estimateof the similarity between documents belonging toclusters In addition SOM can first be computedusing a representative subset of old input data Newinput can be mapped straight into the most similarmodel without recomputing the whole mapping

Nevertheless the study showed some limitations of theapplication of SOM text clustering and classification A firstlimitation is linked to natural language written texts Theterms contained in different texts may be similar even if anengineering change request concerns a different productThesimilarity of terms may influence the performance of SOM-based clustering A second limitation is linked to the use ofSOMas classificationmethod Classification indeed requiresthe labeling of a training datasetThis activity requires a deepknowledge of the different kinds of ECRs managed duringthe engineering change process andmay be difficult and timeconsuming

As a summary research on use of SOM text clusteringin engineering change process analysis appears to be apromising direction for further research A future directionof the work will consider the use of SOM text clustering ona larger dataset of ECRs comparing SOM with other clus-tering algorithms such as119870-means or hierarchical clusteringmethods Another direction of future research concerns theanalysis of SOM robustness to parameters selection (iethe size and structure of the map parameters and kinds oflearning and neighborhood functions)

Symbols

120572(119905) Scalar-valued learning-rate factorN Contingency matrix120575(C119896C119897) Between-cluster distanceΔ(C119896) Within-cluster distanceΔm119894(119905) Adjustment of the prototype vectorm119894(119905)C Collection of documentsC119896 Cluster of documents of index 119896T True partition of documents given by

process operatorsT119903 True partition of documents with label 119903T119903119896 Majority partition containing the

maximum number of documents fromC119896120590(119905) Width of the kernel which corresponds tothe radius of 119873119888(119905)119888(x119895) Index of the BMU associated with x119895119889119895 Document of index 119895

DBI(119870) Davies Bouldin Index for 119870 clusters119889119891119901 Number of documents in the collectionCwhich contains term 119901ℎ119888119894(119905) Neighborhood function119870 Total number of clusters119870lowast Optimum number of clusters119872 Total number of terms in the documentscollection119898119903 Number of documents in partitionT119903119873 Total number of processed documents119873119888(119905) Neighborhood around node 119888119899119896 Number of documents in clusterC119896119899119896119903 Number of documents common to clusterC119896 and partitionT119903119875 Total number of units of SOM119905 Index of time119905119891119901119895 Term frequency of term 119901 in the 119895thdocument119908119901119895 Weight associated with term 119901 in the 119895thdocument

m119888 Prototype vector associated with the BMUm119894 Prototype vector of index 119894r119888 Position vector of index 119888r119894 Position vector of index 119894x119895 Numerical feature vector associated with119895th document

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research has been funded by MIUR ITIA CNR andCluster Fabbrica Intelligente (CFI) Sustainable Manufactur-ing project (CTN01 00163 148175) and Vis4Factory project(Sistemi Informativi Visuali per i processi di fabbrica nelsettore dei trasporti PON02 00634 3551288)The authors arethankful to the technical and managerial staff of MerMecSpA (Italy) for providing the industrial case study of thisresearch

References

[1] T Jarratt J Clarkson and C E Eckert Design Process Improve-ment Springer New York NY USA 2005

[2] C Wanstrom and P Jonsson ldquoThe impact of engineeringchanges on materials planningrdquo Journal of Manufacturing Tech-nology Management vol 17 no 5 pp 561ndash584 2006

[3] S Leng L Wang G Chen and D Tang ldquoEngineering changeinformation propagation in aviation industrial manufacturingexecution processesrdquo The International Journal of AdvancedManufacturing Technology vol 83 no 1 pp 575ndash585 2016

[4] W-H Wu L-C Fang W-Y Wang M-C Yu and H-Y KaoldquoAn advanced CMII-based engineering change managementframework the integration of PLM and ERP perspectivesrdquoInternational Journal of Production Research vol 52 no 20 pp6092ndash6109 2014

Computational Intelligence and Neuroscience 11

[5] T A W Jarratt C M Eckert N H M Caldwell and P JClarkson ldquoEngineering change an overview and perspective onthe literaturerdquo Research in Engineering Design vol 22 no 2 pp103ndash124 2011

[6] B Hamraz N H M Caldwell and P J Clarkson ldquoA holisticcategorization framework for literature on engineering changemanagementrdquo Systems Engineering vol 16 no 4 pp 473ndash5052013

[7] H Ritter T Martinetz and K Schulten Neural Computationand Self-Organizing Maps An Introduction Addison-WesleyNew York NY USA 1992

[8] J Vesanto J Himberg E Alhoniemi and J Parhankangas SOMtoolbox for Matlab 5 2005 httpwwwcishutfisomtoolbox

[9] E Fricke BGebhardHNegele andE Igenbergs ldquoCopingwithchanges causes findings and strategiesrdquo Systems Engineeringvol 3 no 4 pp 169ndash179 2000

[10] J Vesanto and E Alhoniemi ldquoClustering of the self-organizingmaprdquo IEEE Transactions on Neural Networks vol 11 no 3 pp586ndash600 2000

[11] A Sharafi P Wolf and H Krcmar ldquoKnowledge discovery indatabases on the example of engineering change managementrdquoin Proceedings of the Industrial Conference on Data Mining-Poster and Industry Proceedings pp 9ndash16 IBaI July 2010

[12] F Elezi A Sharafi A Mirson P Wolf H Krcmar and ULindemann ldquoA Knowledge Discovery in Databases (KDD)approach for extracting causes of iterations in EngineeringChange Ordersrdquo in Proceedings of the ASME InternationalDesign Engineering Technical Conferences and Computers andInformation in Engineering Conference pp 1401ndash1410 2011

[13] A Sharafi Knowledge Discovery in Databases an Analysis ofChange Management in Product Development Springer 2013

[14] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo Biological Cybernetics vol 43 no 1 pp 59ndash691982

[15] T Kohonen ldquoEssentials of the self-organizing maprdquo NeuralNetworks vol 37 pp 52ndash65 2013

[16] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[17] Y-C Liu M Liu and X-L Wang ldquoApplication of self-organizing maps in text clustering a reviewrdquo in Applicationsof Self-Organizing Maps M Johnsson Ed chapter 10 InTechRijeka Croatia 2012

[18] A Ultsch and H P Siemon ldquoKohonenrsquos self organizing featuremaps for exploratory data analysisrdquo in Proceedings of theProceedings of International Neural Networks Conference (INNCrsquo90) pp 305ndash308 Kluwer Paris France 1990

[19] J Vesanto ldquoSOM-based data visualization methodsrdquo IntelligentData Analysis vol 3 no 2 pp 111ndash126 1999

[20] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 1967

[21] D L Davies and DW Bouldin ldquoA cluster separation measurerdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 1 no 2 pp 224ndash227 1979

[22] N Yorek I Ugulu and H Aydin ldquoUsing self-organizingneural network map combined with Wardrsquos clustering algo-rithm for visualization of studentsrsquo cognitive structural modelsabout aliveness conceptrdquoComputational Intelligence andNeuro-science vol 2016 Article ID 2476256 14 pages 2016

[23] T Honkela S Kaski K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo in Proceedingsof theWork shop on Self-OrganizingMaps (WSOM rsquo97) pp 310ndash315 Espoo Finland June 1997

[24] S Kaski T Honkela K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo Neurocomput-ing vol 21 no 1ndash3 pp 101ndash117 1998

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information ProcessingampManagement vol24 no 5 pp 513ndash523 1988

[26] G Salton and R K Waldstein ldquoTerm relevance weights in on-line information retrievalrdquo Information Processing amp Manage-ment vol 14 no 1 pp 29ndash35 1978

[27] Y C Liu X Wang and C Wu ldquoConSOM a conceptional self-organizingmapmodel for text clusteringrdquoNeurocomputing vol71 no 4ndash6 pp 857ndash862 2008

[28] Y-C Liu C Wu and M Liu ldquoResearch of fast SOM clusteringfor text informationrdquo Expert Systems with Applications vol 38no 8 pp 9325ndash9333 2011

[29] T Kohonen S Kaski K Lagus et al ldquoSelf organization of amassive document collectionrdquo IEEE Transactions on NeuralNetworks vol 11 no 3 pp 574ndash585 2000

[30] J Zhu and S Liu ldquoSOMnetwork based clustering analysis of realestate enterprisesrdquo American Journal of Industrial and BusinessManagement vol 4 no 3 pp 167ndash173 2014

[31] B C Till J Longo A R Dobell and P F Driessen ldquoSelf-organizing maps for latent semantic analysis of free-form textin support of public policy analysisrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 1pp 71ndash86 2014

[32] G Q Huang W Y Yee and K L Mak ldquoCurrent practice ofengineering changemanagement in Hong Kongmanufacturingindustriesrdquo Journal of Materials Processing Technology vol 139no 1ndash3 pp 481ndash487 2003

[33] S Angers ldquoChanging the rules of the lsquoChangersquo gamemdashem-ployees and suppliers work together to transform the 737757production systemrdquo 2002 httpwwwboeingcomnewsfron-tiersarchive2002mayi ca2html

[34] E Subrahmanian C Lee and H Granger ldquoManaging andsupporting product life cycle through engineering changemanagement for a complex productrdquo Research in EngineeringDesign vol 26 no 3 pp 189ndash217 2015

[35] D Zeimpekis and E Gallopoulos TMG A Matlab toolboxfor generating term-document matrices from text collections2006 httpscgroup20ceidupatrasgr8000tmg

[36] M J Zaki and W Meira Jr Data Mining and AnalysisFundamental Concepts and Algorithms Cambridge UniversityPress 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

Computational Intelligence and Neuroscience 5

Change trigger

(2)Identification of possible

solution(s) to change request

(3)Technical evaluation of the

change

(4)Economic evaluation of the

change

(5)Selection and approval of a

solution

(6)Implementation of solution

(7)Updating as-built documents

(1)Engineering change request

Before approval

During approval

After approval

Figure 2The real engineering change process in complex productsenvironments Adapted from Jarratt et al [5]

systems are likely to be affected the person and thedepartment making the request and so forth

(2) Potential solutions to the request for change areidentified

(3) Technical evaluation of the change is carried out Inthis phase the technical impact of implementing eachsolution is assessed Various factors are consideredfor example the impact upon design and productrequirements production schedule and resources tobe devoted

(4) Economic evaluation of the change is performedThe economic risk of implementing each solutionis assessed In this phase costs related to extraproduction times replacements of materials penaltyfor missed due date and so forth are estimated

(5) Once a particular solution is selected it is approvedor not approved The change is reviewed and a costbenefit analysis is carried out When a solution isapproved the engineering change order is preparedand issued

(6) Implementation of the engineering change and iden-tification of the documents such as drawings are tobe updated

(7) Update of the as-built documents occurs As-builtdocuments are usually the original design documentsrevised to reflect any changes made during the pro-cess that is design changes material changes and soforth

Iterations of the process occur for example when a par-ticular solution has negative impact on product requirementsor is too risky to be implemented so the process returns tophase 2 and another solution is identified Another iterationis possible when the costs of a solution are too high or morerisk analysis is required or when the proposed solution iscompletely refused

As shown in Figure 2 no review process of similarchanges faced in the past is carried out during the processor at the endThis aspect is emphasized by Jarratt et al [5] byhighlighting that after a period of time the change should bereviewed to verify if it achieved what was initially intendedand what lessons can be learned for future change processVarious factors can discourage examining the solutionsadopted in the past to a particular change First of all thereis the lack of opportune methods to analyze the documentscollected during the process that is ECR ECRs are oftenforms containing partswritten in natural languageAnalyzingthese kinds of documents in the design phase of a product ora component or when a new change request occurs could bevery time consuming without an appropriate solution

In this context SOM text clustering application canimprove the process When a new ECR occurs in fact ECRsmanaged in the past and similar to the current request couldbe analyzed in order to evaluate the best solution and toavoid repeating the same mistakes made in the past In orderto explore the existence of similarity between the differentECRs texts the first step is to verify the potential clusteringpresent in the analyzed dataset The application of SOM textclustering to ECR documents is explored in the next section

5 The Use of SOM Text Clusteringfor ECRs Analysis

In order to test SOM text clustering we used a datasetof 119873 = 54 ECRs representing some engineering changesmanaged during the engineering change process of a railwaycompany The dataset included the natural language writtendescriptions of the causes of changes contained in the ECRsforms The documents were written in Italian language andthe VSM in Section 3 was used to generate output vectorsfrom input text documents The number of terms that is thedimension of each vector x119895 associated with each document

6 Computational Intelligence and Neuroscience

05

11

17

Figure 3 U-matrix of the 54 ERCs texts preprocessed throughthe VSM Color scale is related to distances between map unitsRed colors represent large distances and blue colors represent smalldistances

in the dataset after the stop word removal and stemmingprocesses was equal to 119872 = 361

In our work we used the MATLAB software SpecificallyTerm toMatrix Generator (TMG) toolbox [35] for documentpreprocessing and SOM toolbox for SOM text clustering [8]were employed

The map size of the SOM is computed through theheuristic formula in Vesanto et al [8] In detail the numberof neurons is computed as 5radic119873 where 119873 is the numberof training samples Map shape is a rectangular grid withhexagonal lattice The neighborhood function is Gaussianand the map is trained using the batch version of SOM Afterthe map is trained each data vector is mapped to the mostsimilar prototype vector in the map that is the BMU whichresults from the matching step in the SOM algorithm In ourcase the network structure is a 2D-lattice of 7 times 5 hexagonal

A first step in cluster analysis based on SOM is based onvisual inspection through U-matrix and Component Planes

The U-matrix obtained on the application of SOM toECR dataset is shown in Figure 3 In order to representadditional information (ie distances) the SOM map size isaugmented by inserting an additional neuron between eachpair of neurons and reporting the distance between the twoneurons The U-matrix and the different colors linked to thedistance between neurons on the map show that five clustersare present in the dataset Each cluster has been highlightedby a circle

Besides looking at the overall differences in the U-matrixit is interesting as well to look at the differences betweeneach component present in the input vectors meaning thatwe look at differences regarding each component associatedwith a single ldquotermrdquo in the input dataset The total numberof Component Planes obtained from the SOM correspondsto the total number of terms in our dataset that is 119872 =361 For illustrative purpose Figure 4 shows two ComponentPlanes chosen as an example The first Component Plane

Table 1 Class label and number of ECR texts

Class Label Number of ECR texts(1) Metal-Sheet MS 12(2) Carter CR 8(3) Antenna AN 8(4) Semi-Finished Round SR 7(5) Hydraulic Panel HP 9(6) Pneumatic System PS 10

(Figure 4(a)) is associated with the term of index 119901 = 48that is the term ldquoAntennardquo the second one (Figure 4(b)) isrelated to the term of index 119901 = 195 that is the term ldquoMetal-Sheetrdquo The difference between the two terms in the datasetcan be represented by considering for example the map unitin top left corner of the two figures This map unit has highvalues for variable ldquoTerm 119901 = 48rdquo and low values for variableldquoTerm119901 = 195rdquo From the observation of Component Planeswe can conclude that there is no correlation between the twoterms As a matter of fact these two terms were never usedtogether into the same documents

As shown above theU-matrix and theComponent Planesallow obtaining a rough cluster structure of the ECR datasetTo get a finer clustering result the prototype vectors fromthe map were clustered using 119870-means algorithm The bestnumber of clusters in the SOM map grip can be determinedby using the DBI values Figure 5(a) shows the DBI values byvarying the number of clusters 119870 in [2 7] The elbow pointin the graph shows that the optimum number of cluster 119870lowastcorresponding to theminimum value of DBI(119870) is equal to 5The clustering result of SOMobtained by using119870-meanswith119870lowast = 5 clusters is shown in Figure 5(b)The BMUs belongingto each cluster are represented with a different color and ineach hexagon the number of documents associated with eachBMU is provided

51 External Validation of SOM Text Clustering In thereference case study the true classification of each ECR inthe dataset was provided by process operators Thereforethis information was used in our study in order to per-form an external validation of the SOM text clustering Inparticular each ECR text was analyzed and classified withreference to themain component involved in the engineeringchange (namely ldquoMetal-Sheetrdquo ldquoCarterrdquo ldquoAntennardquo ldquoSemi-Finished Roundrdquo ldquoHydraulic Panelrdquo and ldquoPneumatic Sys-temrdquo) Table 1 reports the number of ECRs related to eachcomponent along with the labels used in order to classify theECR documents (namely ldquoMSrdquo ldquoCRrdquo ldquoANrdquo ldquoSRrdquo ldquoHPrdquo andldquoPSrdquo respectively) Although a classification is available in thespecific case study of this paper it is worth noting that oftensuch information may be unavailable

By superimposing the classification information themap grid resulting by the training of the SOM is as inFigure 6 where each hexagon reports the classification labelof documents sharing a given BMU (within brackets thenumber of associated documents) From Figure 6 it can beobserved that the unsupervised clustering produced by theSOMalgorithm is quite coherent with the actual classification

Computational Intelligence and Neuroscience 7

Term p = 48

(a)d

0

05

1

Term p = 195

(b)

Figure 4 Component Planes of two variables chosen as example (a) Component Plane related to the variable of index 119901 = 48 correspondingto the term ldquoAntennardquo (b) Component Plane of the variable of index 119901 = 195 corresponding to the term ldquoMetal-Sheetrdquo (a) and (b) are linkedby position in each figure the hexagon in a certain position corresponds to the same map units Each Component Plane shows the values ofthe variable in each map unit using color-coding Letter ldquodrdquo means that denormalized values are used that is the values in the original datacontext are used [8]

1 2 3 4 5 6 708

09

1

11

12

13

14

15

16

17

18

K

DBI

(K)

(a)

5

3

3

6

1

2

1

2

3

1

1

1

1

2

2

1

1

6

4

3

3

2

(b)

Figure 5 (a) Davies Bouldin Index (DBI) values for number of clusters119870 in [2 7] (b) Clustering result of SOM using119870-means with119870lowast = 5clusters In each hexagon the number of documents sharing the same BMU is provided

given by process operators in fact ECR documents sharingthe same classification label are included in BMUs belongingto the same cluster It is worth noting that the actual labelis assigned to each document after the SOM unsupervisedtraining has been carried out From Figure 6 it can be also

noted that ECRs classified either as ldquoPSrdquo and ldquoHPrdquo are allincluded in a unique cluster Furthermore two documentsout of twelve with label ldquoMSrdquo are assigned to clusters thatinclude different labels namely ldquoCRrdquo ldquoPSrdquo and ldquoHPrdquo Weinvestigated this misclassification and we found that causes

8 Computational Intelligence and Neuroscience

AN(5)

CR(3)

CR(3)

SR(6)

AN(1)

CR(2)

SR(1)

AN(2)

PS(3)

HP(1)

MS(1)

MS(1)

PS(1)

HP(2)

MS(2)

MS(1)

MS(1)

PS(6)

HP(4)

HP(2)MS(1)

MS(3)

MS(2)

Figure 6 Labeled map grid of SOM In each hexagon labels andtotal number of documents sharing the same BMU Coloring refersto 119870-means unsupervised clustering of the SOM

Table 2 Contingency matrix N

MS CR AN SR HP PS 119899119903T1 T2 T3 T4 T5 T61198621 0 0 8 0 0 0 81198622 1 0 0 0 9 10 201198623 1 8 0 0 0 0 91198624 0 0 0 7 0 0 71198625 10 0 0 0 0 0 10119898119896 12 8 8 7 9 10 119873 = 54

of changes described in these two ldquoMSrdquo documents are quitesimilar to those contained in documents labeled as ldquoCRrdquoldquoPSrdquo and ldquoHPrdquo

Given the actual classification the quality of the obtainedclustering can be evaluated by computing four indices purityprecision recall and 119865-measure [36]

Let T = T1 T119903 T119877 be the true partitioninggiven by the process operators where the partition T119903consists of all the documents with label 119903 Let 119898119903 = |T119903|denote the number of documents in true partition T119903 Alsolet C = C1 C119896 C119870 denote the clustering obtainedvia the SOM text clustering algorithm and 119899119896 = |C119896|denote the number of documents in cluster C119896 The 119870 times 119877contingency matrix N induced by clustering C and the truepartitioning T can be obtained by computing the elementsN(119896 119903) = 119899119896119903 = |C119896 cap T119903| where 119899119896119903 denotes the number ofdocuments that are common to clusterC119896 and true partitionT119903 The contingency matrix for SOM text clustering of ECRsis reported in Table 2

Starting fromTable 2 the following indices are computed

(i) Purity Index The cluster-specific purity index ofclusterC119896 is defined as

purity119896 = 1119899119896119877max119903=1

119899119896119903 (8)

The overall purity index of the entire clustering C iscomputed as

purity = 119870sum119896=1

119899119896119873purity119896 = 1119873119870sum119896=1

119877max119903=1

119899119896119903 (9)

As shown in Table 3 clusters C1 C4 and C5 havepurity index equal to 1 that is they contain entitiesfrom only one partition Clusters C2 and C3 gatherentities from different partitions that isT1T5 andT6 for cluster C2 and T1 T2 for cluster C3 Theoverall purity index is equal to 079

(ii) Precision Index Given a clusterC119896 letT119903119896 denote themajority partition that contains the maximum num-ber of documents fromC119896 that is 119903119896 = argmax119877119903=1119899119896119903The precision index of a clusterC119896 is given by

prec119896 = 1119899119896119877max119903=1

119899119896119903 = 119899119896119903119896119899119896 (10)

For clustering in Table 2 the majority partitionsare T31 T62 T23 T44 and T15 Precision indicesshow that all documents gathered in clusters C1C4 and C5 belong to the corresponding majoritypartitionsT31 T44 andT15 For clusterC2 the 50of documents belong to T62 and finally the 88 ofdocuments in clusterC3 belong toT23

(iii) Recall Index Given a clusterC119896 it is defined as

recall119896 = 11989911989611990311989610038161003816100381610038161003816T119903119896 10038161003816100381610038161003816 = 119899119896119903119896119898119903119896 (11)

where 119898119903119896 = |T119903119896 | It measures the fraction ofdocuments in partition T119903119896 shared in common withclusterC119896The recall indices reported in Table 3 showthat clusters C1 C2 C3 and C4 shared in commonthe 100 of documents in majority partitions T31 T62 T23 andT44 respectively ClusterC5 shared the83 of documents inT15

(iv) 119865-Measure Index It is the harmonic mean of theprecision and recall values for each cluster The 119865-measure for cluster 119862119896 is therefore given as

119865119896 = 2 sdot prec119896 sdot recall119896prec119896 + recall119896

= 2119899119896119903119896119899119896 + 119898119903119896 (12)

The overall 119865-measure for the clustering C is themean of the clusterwise 119865-measure values

119865 = 1119870119870sum119896=1

119865119896 (13)

Table 3 shows that 119865-measure of clusters C1 and C4is equal to 1 while other values are less than 1 Thelow values of 119865-measures for clusters C2 C3 andC5 depend on a low precision index for clusters C2and C3 and on a low recall index for cluster C5Consequently the overall 119865-measure is equal to 090

Computational Intelligence and Neuroscience 9

Table 3 Clustering validation

Cluster-specific and overall indicesPurity

purity1 = 1 purity2 = 05 purity3 = 089 purity4 = 1 purity5 = 1purity = 079Precision

prec1 = 1 prec2 = 05 prec3 = 089 prec4 = 1 prec5 = 1Recall

recall1 = 1 recall2 = 1 recall3 = 1 recall4 = 1 recall5 = 083F-measure

1198651 = 1 1198652 = 066 1198653 = 094 1198654 = 1 1198655 = 091119865 = 090Table 4 Results of SOM-based classification In the first and in the second column labels and number of ECR texts of the actual classificationare reported In the third column the number of ECRs correctly classified through the label of first associated BMU In the fourth column thenumber of document associated with an empty first BMU but correctly classified by considering the label of documents sharing the secondassociated BMU

Labels Number of ECRs Number of ECRs classified through the 1st BMU Number of ECRs classified through the 2nd BMUMS 12 10 2CR 8 6 2AN 8 7 1SR 7 6 1HP 9 6 3PS 10 10 mdash

Given the actual classification the SOM can be furthervalidated through a leave-one-out cross validation techniquein order to check its classification ability In particular119873 minus 1ECR documents are used for training and the remaining onefor testing (iterating until each ECR text in the data has beenused for testing)

At each iteration once SOM has been trained on 119873 minus 1ECRs and when the testing sample is presented as input aBMU is selected in the matching step of the SOM algorithmThe label of the training documents associatedwith that BMUis considered In the case of an empty BMU that is whichresults are not associated with any training documents theclosest one associated with at least one training document isconsidered instead while in case of a BMU associated withtraining documents with more than one label the label withgreater number of documents is considered

Table 4 shows the results of the leave-one-out cross vali-dation For each row that is for a given ECR label the secondcolumn reports the total number of documents in the datasetwhile the last two columns report the number of testingECRs correctly classified by the SOM In particular thethird column reports the number of testing ECRs correctlyclassified as they were connected to a first BMU associatedwith training documents with the same labelThe last columnrefers to the number of ECRs associated with an empty firstBMU that nevertheless resulted closest to a second BMUrelated to documents belonging to the same class of thetesting sample Also cross validation study demonstrates thatlabels given by SOM are coherent with actual classificationand confirms the ability of SOM as classification tool

6 Conclusions

In this paper a real case study regarding the engineeringchange process in complex products industry was conductedThe study concerned the postchange stage of the engineeringchange process in which past engineering changes dataare analyzed to discover information exploitable in newengineering changes In particular SOM was used for clus-tering of natural language written texts produced during theengineering change process The analyzed texts included thedescriptions of the causes of changes contained in the ECRforms Firstly SOM algorithm was used as clustering tool tofind relationships between the ECR texts and to cluster themaccordingly Subsequently SOM was tested as classificationtool and the results were validated through a leave-one-outcross validation technique

The results of the real case study showed that the use of theSOM text clustering can be an effective tool in improvementof the engineering change process analysis In particularsome of the advantages highlighted in this study are asfollows

(1) Text mining methods allow analyzing unstructureddata and deriving high-quality informationThemaindifficulty in ECR analysis consisted in analyzingnatural language written texts

(2) Clustering analysis of past ECRs stored in the com-pany allows automatically gathering ECRs on thebasis of similarity between documents When a newchange triggers the company can quickly focus on

10 Computational Intelligence and Neuroscience

the cluster of interest Clustering can support thecompany to know if a similar change was alreadymanaged in the past to analyze the best solutionadopted and to learn avoiding the same mistakesmade in the past

(3) Use of SOM for ECRs text clustering allows automat-ically organizing large documents collection Withrespect to other clustering algorithms the mainadvantage of SOM text clustering is that the similarityof the texts is preserved in the spatial organization ofthe neurons The distance among prototypes in theSOMmap can therefore be considered as an estimateof the similarity between documents belonging toclusters In addition SOM can first be computedusing a representative subset of old input data Newinput can be mapped straight into the most similarmodel without recomputing the whole mapping

Nevertheless the study showed some limitations of theapplication of SOM text clustering and classification A firstlimitation is linked to natural language written texts Theterms contained in different texts may be similar even if anengineering change request concerns a different productThesimilarity of terms may influence the performance of SOM-based clustering A second limitation is linked to the use ofSOMas classificationmethod Classification indeed requiresthe labeling of a training datasetThis activity requires a deepknowledge of the different kinds of ECRs managed duringthe engineering change process andmay be difficult and timeconsuming

As a summary research on use of SOM text clusteringin engineering change process analysis appears to be apromising direction for further research A future directionof the work will consider the use of SOM text clustering ona larger dataset of ECRs comparing SOM with other clus-tering algorithms such as119870-means or hierarchical clusteringmethods Another direction of future research concerns theanalysis of SOM robustness to parameters selection (iethe size and structure of the map parameters and kinds oflearning and neighborhood functions)

Symbols

120572(119905) Scalar-valued learning-rate factorN Contingency matrix120575(C119896C119897) Between-cluster distanceΔ(C119896) Within-cluster distanceΔm119894(119905) Adjustment of the prototype vectorm119894(119905)C Collection of documentsC119896 Cluster of documents of index 119896T True partition of documents given by

process operatorsT119903 True partition of documents with label 119903T119903119896 Majority partition containing the

maximum number of documents fromC119896120590(119905) Width of the kernel which corresponds tothe radius of 119873119888(119905)119888(x119895) Index of the BMU associated with x119895119889119895 Document of index 119895

DBI(119870) Davies Bouldin Index for 119870 clusters119889119891119901 Number of documents in the collectionCwhich contains term 119901ℎ119888119894(119905) Neighborhood function119870 Total number of clusters119870lowast Optimum number of clusters119872 Total number of terms in the documentscollection119898119903 Number of documents in partitionT119903119873 Total number of processed documents119873119888(119905) Neighborhood around node 119888119899119896 Number of documents in clusterC119896119899119896119903 Number of documents common to clusterC119896 and partitionT119903119875 Total number of units of SOM119905 Index of time119905119891119901119895 Term frequency of term 119901 in the 119895thdocument119908119901119895 Weight associated with term 119901 in the 119895thdocument

m119888 Prototype vector associated with the BMUm119894 Prototype vector of index 119894r119888 Position vector of index 119888r119894 Position vector of index 119894x119895 Numerical feature vector associated with119895th document

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research has been funded by MIUR ITIA CNR andCluster Fabbrica Intelligente (CFI) Sustainable Manufactur-ing project (CTN01 00163 148175) and Vis4Factory project(Sistemi Informativi Visuali per i processi di fabbrica nelsettore dei trasporti PON02 00634 3551288)The authors arethankful to the technical and managerial staff of MerMecSpA (Italy) for providing the industrial case study of thisresearch

References

[1] T Jarratt J Clarkson and C E Eckert Design Process Improve-ment Springer New York NY USA 2005

[2] C Wanstrom and P Jonsson ldquoThe impact of engineeringchanges on materials planningrdquo Journal of Manufacturing Tech-nology Management vol 17 no 5 pp 561ndash584 2006

[3] S Leng L Wang G Chen and D Tang ldquoEngineering changeinformation propagation in aviation industrial manufacturingexecution processesrdquo The International Journal of AdvancedManufacturing Technology vol 83 no 1 pp 575ndash585 2016

[4] W-H Wu L-C Fang W-Y Wang M-C Yu and H-Y KaoldquoAn advanced CMII-based engineering change managementframework the integration of PLM and ERP perspectivesrdquoInternational Journal of Production Research vol 52 no 20 pp6092ndash6109 2014

Computational Intelligence and Neuroscience 11

[5] T A W Jarratt C M Eckert N H M Caldwell and P JClarkson ldquoEngineering change an overview and perspective onthe literaturerdquo Research in Engineering Design vol 22 no 2 pp103ndash124 2011

[6] B Hamraz N H M Caldwell and P J Clarkson ldquoA holisticcategorization framework for literature on engineering changemanagementrdquo Systems Engineering vol 16 no 4 pp 473ndash5052013

[7] H Ritter T Martinetz and K Schulten Neural Computationand Self-Organizing Maps An Introduction Addison-WesleyNew York NY USA 1992

[8] J Vesanto J Himberg E Alhoniemi and J Parhankangas SOMtoolbox for Matlab 5 2005 httpwwwcishutfisomtoolbox

[9] E Fricke BGebhardHNegele andE Igenbergs ldquoCopingwithchanges causes findings and strategiesrdquo Systems Engineeringvol 3 no 4 pp 169ndash179 2000

[10] J Vesanto and E Alhoniemi ldquoClustering of the self-organizingmaprdquo IEEE Transactions on Neural Networks vol 11 no 3 pp586ndash600 2000

[11] A Sharafi P Wolf and H Krcmar ldquoKnowledge discovery indatabases on the example of engineering change managementrdquoin Proceedings of the Industrial Conference on Data Mining-Poster and Industry Proceedings pp 9ndash16 IBaI July 2010

[12] F Elezi A Sharafi A Mirson P Wolf H Krcmar and ULindemann ldquoA Knowledge Discovery in Databases (KDD)approach for extracting causes of iterations in EngineeringChange Ordersrdquo in Proceedings of the ASME InternationalDesign Engineering Technical Conferences and Computers andInformation in Engineering Conference pp 1401ndash1410 2011

[13] A Sharafi Knowledge Discovery in Databases an Analysis ofChange Management in Product Development Springer 2013

[14] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo Biological Cybernetics vol 43 no 1 pp 59ndash691982

[15] T Kohonen ldquoEssentials of the self-organizing maprdquo NeuralNetworks vol 37 pp 52ndash65 2013

[16] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[17] Y-C Liu M Liu and X-L Wang ldquoApplication of self-organizing maps in text clustering a reviewrdquo in Applicationsof Self-Organizing Maps M Johnsson Ed chapter 10 InTechRijeka Croatia 2012

[18] A Ultsch and H P Siemon ldquoKohonenrsquos self organizing featuremaps for exploratory data analysisrdquo in Proceedings of theProceedings of International Neural Networks Conference (INNCrsquo90) pp 305ndash308 Kluwer Paris France 1990

[19] J Vesanto ldquoSOM-based data visualization methodsrdquo IntelligentData Analysis vol 3 no 2 pp 111ndash126 1999

[20] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 1967

[21] D L Davies and DW Bouldin ldquoA cluster separation measurerdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 1 no 2 pp 224ndash227 1979

[22] N Yorek I Ugulu and H Aydin ldquoUsing self-organizingneural network map combined with Wardrsquos clustering algo-rithm for visualization of studentsrsquo cognitive structural modelsabout aliveness conceptrdquoComputational Intelligence andNeuro-science vol 2016 Article ID 2476256 14 pages 2016

[23] T Honkela S Kaski K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo in Proceedingsof theWork shop on Self-OrganizingMaps (WSOM rsquo97) pp 310ndash315 Espoo Finland June 1997

[24] S Kaski T Honkela K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo Neurocomput-ing vol 21 no 1ndash3 pp 101ndash117 1998

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information ProcessingampManagement vol24 no 5 pp 513ndash523 1988

[26] G Salton and R K Waldstein ldquoTerm relevance weights in on-line information retrievalrdquo Information Processing amp Manage-ment vol 14 no 1 pp 29ndash35 1978

[27] Y C Liu X Wang and C Wu ldquoConSOM a conceptional self-organizingmapmodel for text clusteringrdquoNeurocomputing vol71 no 4ndash6 pp 857ndash862 2008

[28] Y-C Liu C Wu and M Liu ldquoResearch of fast SOM clusteringfor text informationrdquo Expert Systems with Applications vol 38no 8 pp 9325ndash9333 2011

[29] T Kohonen S Kaski K Lagus et al ldquoSelf organization of amassive document collectionrdquo IEEE Transactions on NeuralNetworks vol 11 no 3 pp 574ndash585 2000

[30] J Zhu and S Liu ldquoSOMnetwork based clustering analysis of realestate enterprisesrdquo American Journal of Industrial and BusinessManagement vol 4 no 3 pp 167ndash173 2014

[31] B C Till J Longo A R Dobell and P F Driessen ldquoSelf-organizing maps for latent semantic analysis of free-form textin support of public policy analysisrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 1pp 71ndash86 2014

[32] G Q Huang W Y Yee and K L Mak ldquoCurrent practice ofengineering changemanagement in Hong Kongmanufacturingindustriesrdquo Journal of Materials Processing Technology vol 139no 1ndash3 pp 481ndash487 2003

[33] S Angers ldquoChanging the rules of the lsquoChangersquo gamemdashem-ployees and suppliers work together to transform the 737757production systemrdquo 2002 httpwwwboeingcomnewsfron-tiersarchive2002mayi ca2html

[34] E Subrahmanian C Lee and H Granger ldquoManaging andsupporting product life cycle through engineering changemanagement for a complex productrdquo Research in EngineeringDesign vol 26 no 3 pp 189ndash217 2015

[35] D Zeimpekis and E Gallopoulos TMG A Matlab toolboxfor generating term-document matrices from text collections2006 httpscgroup20ceidupatrasgr8000tmg

[36] M J Zaki and W Meira Jr Data Mining and AnalysisFundamental Concepts and Algorithms Cambridge UniversityPress 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

6 Computational Intelligence and Neuroscience

05

11

17

Figure 3 U-matrix of the 54 ERCs texts preprocessed throughthe VSM Color scale is related to distances between map unitsRed colors represent large distances and blue colors represent smalldistances

in the dataset after the stop word removal and stemmingprocesses was equal to 119872 = 361

In our work we used the MATLAB software SpecificallyTerm toMatrix Generator (TMG) toolbox [35] for documentpreprocessing and SOM toolbox for SOM text clustering [8]were employed

The map size of the SOM is computed through theheuristic formula in Vesanto et al [8] In detail the numberof neurons is computed as 5radic119873 where 119873 is the numberof training samples Map shape is a rectangular grid withhexagonal lattice The neighborhood function is Gaussianand the map is trained using the batch version of SOM Afterthe map is trained each data vector is mapped to the mostsimilar prototype vector in the map that is the BMU whichresults from the matching step in the SOM algorithm In ourcase the network structure is a 2D-lattice of 7 times 5 hexagonal

A first step in cluster analysis based on SOM is based onvisual inspection through U-matrix and Component Planes

The U-matrix obtained on the application of SOM toECR dataset is shown in Figure 3 In order to representadditional information (ie distances) the SOM map size isaugmented by inserting an additional neuron between eachpair of neurons and reporting the distance between the twoneurons The U-matrix and the different colors linked to thedistance between neurons on the map show that five clustersare present in the dataset Each cluster has been highlightedby a circle

Besides looking at the overall differences in the U-matrixit is interesting as well to look at the differences betweeneach component present in the input vectors meaning thatwe look at differences regarding each component associatedwith a single ldquotermrdquo in the input dataset The total numberof Component Planes obtained from the SOM correspondsto the total number of terms in our dataset that is 119872 =361 For illustrative purpose Figure 4 shows two ComponentPlanes chosen as an example The first Component Plane

Table 1 Class label and number of ECR texts

Class Label Number of ECR texts(1) Metal-Sheet MS 12(2) Carter CR 8(3) Antenna AN 8(4) Semi-Finished Round SR 7(5) Hydraulic Panel HP 9(6) Pneumatic System PS 10

(Figure 4(a)) is associated with the term of index 119901 = 48that is the term ldquoAntennardquo the second one (Figure 4(b)) isrelated to the term of index 119901 = 195 that is the term ldquoMetal-Sheetrdquo The difference between the two terms in the datasetcan be represented by considering for example the map unitin top left corner of the two figures This map unit has highvalues for variable ldquoTerm 119901 = 48rdquo and low values for variableldquoTerm119901 = 195rdquo From the observation of Component Planeswe can conclude that there is no correlation between the twoterms As a matter of fact these two terms were never usedtogether into the same documents

As shown above theU-matrix and theComponent Planesallow obtaining a rough cluster structure of the ECR datasetTo get a finer clustering result the prototype vectors fromthe map were clustered using 119870-means algorithm The bestnumber of clusters in the SOM map grip can be determinedby using the DBI values Figure 5(a) shows the DBI values byvarying the number of clusters 119870 in [2 7] The elbow pointin the graph shows that the optimum number of cluster 119870lowastcorresponding to theminimum value of DBI(119870) is equal to 5The clustering result of SOMobtained by using119870-meanswith119870lowast = 5 clusters is shown in Figure 5(b)The BMUs belongingto each cluster are represented with a different color and ineach hexagon the number of documents associated with eachBMU is provided

51 External Validation of SOM Text Clustering In thereference case study the true classification of each ECR inthe dataset was provided by process operators Thereforethis information was used in our study in order to per-form an external validation of the SOM text clustering Inparticular each ECR text was analyzed and classified withreference to themain component involved in the engineeringchange (namely ldquoMetal-Sheetrdquo ldquoCarterrdquo ldquoAntennardquo ldquoSemi-Finished Roundrdquo ldquoHydraulic Panelrdquo and ldquoPneumatic Sys-temrdquo) Table 1 reports the number of ECRs related to eachcomponent along with the labels used in order to classify theECR documents (namely ldquoMSrdquo ldquoCRrdquo ldquoANrdquo ldquoSRrdquo ldquoHPrdquo andldquoPSrdquo respectively) Although a classification is available in thespecific case study of this paper it is worth noting that oftensuch information may be unavailable

By superimposing the classification information themap grid resulting by the training of the SOM is as inFigure 6 where each hexagon reports the classification labelof documents sharing a given BMU (within brackets thenumber of associated documents) From Figure 6 it can beobserved that the unsupervised clustering produced by theSOMalgorithm is quite coherent with the actual classification

Computational Intelligence and Neuroscience 7

Term p = 48

(a)d

0

05

1

Term p = 195

(b)

Figure 4 Component Planes of two variables chosen as example (a) Component Plane related to the variable of index 119901 = 48 correspondingto the term ldquoAntennardquo (b) Component Plane of the variable of index 119901 = 195 corresponding to the term ldquoMetal-Sheetrdquo (a) and (b) are linkedby position in each figure the hexagon in a certain position corresponds to the same map units Each Component Plane shows the values ofthe variable in each map unit using color-coding Letter ldquodrdquo means that denormalized values are used that is the values in the original datacontext are used [8]

1 2 3 4 5 6 708

09

1

11

12

13

14

15

16

17

18

K

DBI

(K)

(a)

5

3

3

6

1

2

1

2

3

1

1

1

1

2

2

1

1

6

4

3

3

2

(b)

Figure 5 (a) Davies Bouldin Index (DBI) values for number of clusters119870 in [2 7] (b) Clustering result of SOM using119870-means with119870lowast = 5clusters In each hexagon the number of documents sharing the same BMU is provided

given by process operators in fact ECR documents sharingthe same classification label are included in BMUs belongingto the same cluster It is worth noting that the actual labelis assigned to each document after the SOM unsupervisedtraining has been carried out From Figure 6 it can be also

noted that ECRs classified either as ldquoPSrdquo and ldquoHPrdquo are allincluded in a unique cluster Furthermore two documentsout of twelve with label ldquoMSrdquo are assigned to clusters thatinclude different labels namely ldquoCRrdquo ldquoPSrdquo and ldquoHPrdquo Weinvestigated this misclassification and we found that causes

8 Computational Intelligence and Neuroscience

AN(5)

CR(3)

CR(3)

SR(6)

AN(1)

CR(2)

SR(1)

AN(2)

PS(3)

HP(1)

MS(1)

MS(1)

PS(1)

HP(2)

MS(2)

MS(1)

MS(1)

PS(6)

HP(4)

HP(2)MS(1)

MS(3)

MS(2)

Figure 6 Labeled map grid of SOM In each hexagon labels andtotal number of documents sharing the same BMU Coloring refersto 119870-means unsupervised clustering of the SOM

Table 2 Contingency matrix N

MS CR AN SR HP PS 119899119903T1 T2 T3 T4 T5 T61198621 0 0 8 0 0 0 81198622 1 0 0 0 9 10 201198623 1 8 0 0 0 0 91198624 0 0 0 7 0 0 71198625 10 0 0 0 0 0 10119898119896 12 8 8 7 9 10 119873 = 54

of changes described in these two ldquoMSrdquo documents are quitesimilar to those contained in documents labeled as ldquoCRrdquoldquoPSrdquo and ldquoHPrdquo

Given the actual classification the quality of the obtainedclustering can be evaluated by computing four indices purityprecision recall and 119865-measure [36]

Let T = T1 T119903 T119877 be the true partitioninggiven by the process operators where the partition T119903consists of all the documents with label 119903 Let 119898119903 = |T119903|denote the number of documents in true partition T119903 Alsolet C = C1 C119896 C119870 denote the clustering obtainedvia the SOM text clustering algorithm and 119899119896 = |C119896|denote the number of documents in cluster C119896 The 119870 times 119877contingency matrix N induced by clustering C and the truepartitioning T can be obtained by computing the elementsN(119896 119903) = 119899119896119903 = |C119896 cap T119903| where 119899119896119903 denotes the number ofdocuments that are common to clusterC119896 and true partitionT119903 The contingency matrix for SOM text clustering of ECRsis reported in Table 2

Starting fromTable 2 the following indices are computed

(i) Purity Index The cluster-specific purity index ofclusterC119896 is defined as

purity119896 = 1119899119896119877max119903=1

119899119896119903 (8)

The overall purity index of the entire clustering C iscomputed as

purity = 119870sum119896=1

119899119896119873purity119896 = 1119873119870sum119896=1

119877max119903=1

119899119896119903 (9)

As shown in Table 3 clusters C1 C4 and C5 havepurity index equal to 1 that is they contain entitiesfrom only one partition Clusters C2 and C3 gatherentities from different partitions that isT1T5 andT6 for cluster C2 and T1 T2 for cluster C3 Theoverall purity index is equal to 079

(ii) Precision Index Given a clusterC119896 letT119903119896 denote themajority partition that contains the maximum num-ber of documents fromC119896 that is 119903119896 = argmax119877119903=1119899119896119903The precision index of a clusterC119896 is given by

prec119896 = 1119899119896119877max119903=1

119899119896119903 = 119899119896119903119896119899119896 (10)

For clustering in Table 2 the majority partitionsare T31 T62 T23 T44 and T15 Precision indicesshow that all documents gathered in clusters C1C4 and C5 belong to the corresponding majoritypartitionsT31 T44 andT15 For clusterC2 the 50of documents belong to T62 and finally the 88 ofdocuments in clusterC3 belong toT23

(iii) Recall Index Given a clusterC119896 it is defined as

recall119896 = 11989911989611990311989610038161003816100381610038161003816T119903119896 10038161003816100381610038161003816 = 119899119896119903119896119898119903119896 (11)

where 119898119903119896 = |T119903119896 | It measures the fraction ofdocuments in partition T119903119896 shared in common withclusterC119896The recall indices reported in Table 3 showthat clusters C1 C2 C3 and C4 shared in commonthe 100 of documents in majority partitions T31 T62 T23 andT44 respectively ClusterC5 shared the83 of documents inT15

(iv) 119865-Measure Index It is the harmonic mean of theprecision and recall values for each cluster The 119865-measure for cluster 119862119896 is therefore given as

119865119896 = 2 sdot prec119896 sdot recall119896prec119896 + recall119896

= 2119899119896119903119896119899119896 + 119898119903119896 (12)

The overall 119865-measure for the clustering C is themean of the clusterwise 119865-measure values

119865 = 1119870119870sum119896=1

119865119896 (13)

Table 3 shows that 119865-measure of clusters C1 and C4is equal to 1 while other values are less than 1 Thelow values of 119865-measures for clusters C2 C3 andC5 depend on a low precision index for clusters C2and C3 and on a low recall index for cluster C5Consequently the overall 119865-measure is equal to 090

Computational Intelligence and Neuroscience 9

Table 3 Clustering validation

Cluster-specific and overall indicesPurity

purity1 = 1 purity2 = 05 purity3 = 089 purity4 = 1 purity5 = 1purity = 079Precision

prec1 = 1 prec2 = 05 prec3 = 089 prec4 = 1 prec5 = 1Recall

recall1 = 1 recall2 = 1 recall3 = 1 recall4 = 1 recall5 = 083F-measure

1198651 = 1 1198652 = 066 1198653 = 094 1198654 = 1 1198655 = 091119865 = 090Table 4 Results of SOM-based classification In the first and in the second column labels and number of ECR texts of the actual classificationare reported In the third column the number of ECRs correctly classified through the label of first associated BMU In the fourth column thenumber of document associated with an empty first BMU but correctly classified by considering the label of documents sharing the secondassociated BMU

Labels Number of ECRs Number of ECRs classified through the 1st BMU Number of ECRs classified through the 2nd BMUMS 12 10 2CR 8 6 2AN 8 7 1SR 7 6 1HP 9 6 3PS 10 10 mdash

Given the actual classification the SOM can be furthervalidated through a leave-one-out cross validation techniquein order to check its classification ability In particular119873 minus 1ECR documents are used for training and the remaining onefor testing (iterating until each ECR text in the data has beenused for testing)

At each iteration once SOM has been trained on 119873 minus 1ECRs and when the testing sample is presented as input aBMU is selected in the matching step of the SOM algorithmThe label of the training documents associatedwith that BMUis considered In the case of an empty BMU that is whichresults are not associated with any training documents theclosest one associated with at least one training document isconsidered instead while in case of a BMU associated withtraining documents with more than one label the label withgreater number of documents is considered

Table 4 shows the results of the leave-one-out cross vali-dation For each row that is for a given ECR label the secondcolumn reports the total number of documents in the datasetwhile the last two columns report the number of testingECRs correctly classified by the SOM In particular thethird column reports the number of testing ECRs correctlyclassified as they were connected to a first BMU associatedwith training documents with the same labelThe last columnrefers to the number of ECRs associated with an empty firstBMU that nevertheless resulted closest to a second BMUrelated to documents belonging to the same class of thetesting sample Also cross validation study demonstrates thatlabels given by SOM are coherent with actual classificationand confirms the ability of SOM as classification tool

6 Conclusions

In this paper a real case study regarding the engineeringchange process in complex products industry was conductedThe study concerned the postchange stage of the engineeringchange process in which past engineering changes dataare analyzed to discover information exploitable in newengineering changes In particular SOM was used for clus-tering of natural language written texts produced during theengineering change process The analyzed texts included thedescriptions of the causes of changes contained in the ECRforms Firstly SOM algorithm was used as clustering tool tofind relationships between the ECR texts and to cluster themaccordingly Subsequently SOM was tested as classificationtool and the results were validated through a leave-one-outcross validation technique

The results of the real case study showed that the use of theSOM text clustering can be an effective tool in improvementof the engineering change process analysis In particularsome of the advantages highlighted in this study are asfollows

(1) Text mining methods allow analyzing unstructureddata and deriving high-quality informationThemaindifficulty in ECR analysis consisted in analyzingnatural language written texts

(2) Clustering analysis of past ECRs stored in the com-pany allows automatically gathering ECRs on thebasis of similarity between documents When a newchange triggers the company can quickly focus on

10 Computational Intelligence and Neuroscience

the cluster of interest Clustering can support thecompany to know if a similar change was alreadymanaged in the past to analyze the best solutionadopted and to learn avoiding the same mistakesmade in the past

(3) Use of SOM for ECRs text clustering allows automat-ically organizing large documents collection Withrespect to other clustering algorithms the mainadvantage of SOM text clustering is that the similarityof the texts is preserved in the spatial organization ofthe neurons The distance among prototypes in theSOMmap can therefore be considered as an estimateof the similarity between documents belonging toclusters In addition SOM can first be computedusing a representative subset of old input data Newinput can be mapped straight into the most similarmodel without recomputing the whole mapping

Nevertheless the study showed some limitations of theapplication of SOM text clustering and classification A firstlimitation is linked to natural language written texts Theterms contained in different texts may be similar even if anengineering change request concerns a different productThesimilarity of terms may influence the performance of SOM-based clustering A second limitation is linked to the use ofSOMas classificationmethod Classification indeed requiresthe labeling of a training datasetThis activity requires a deepknowledge of the different kinds of ECRs managed duringthe engineering change process andmay be difficult and timeconsuming

As a summary research on use of SOM text clusteringin engineering change process analysis appears to be apromising direction for further research A future directionof the work will consider the use of SOM text clustering ona larger dataset of ECRs comparing SOM with other clus-tering algorithms such as119870-means or hierarchical clusteringmethods Another direction of future research concerns theanalysis of SOM robustness to parameters selection (iethe size and structure of the map parameters and kinds oflearning and neighborhood functions)

Symbols

120572(119905) Scalar-valued learning-rate factorN Contingency matrix120575(C119896C119897) Between-cluster distanceΔ(C119896) Within-cluster distanceΔm119894(119905) Adjustment of the prototype vectorm119894(119905)C Collection of documentsC119896 Cluster of documents of index 119896T True partition of documents given by

process operatorsT119903 True partition of documents with label 119903T119903119896 Majority partition containing the

maximum number of documents fromC119896120590(119905) Width of the kernel which corresponds tothe radius of 119873119888(119905)119888(x119895) Index of the BMU associated with x119895119889119895 Document of index 119895

DBI(119870) Davies Bouldin Index for 119870 clusters119889119891119901 Number of documents in the collectionCwhich contains term 119901ℎ119888119894(119905) Neighborhood function119870 Total number of clusters119870lowast Optimum number of clusters119872 Total number of terms in the documentscollection119898119903 Number of documents in partitionT119903119873 Total number of processed documents119873119888(119905) Neighborhood around node 119888119899119896 Number of documents in clusterC119896119899119896119903 Number of documents common to clusterC119896 and partitionT119903119875 Total number of units of SOM119905 Index of time119905119891119901119895 Term frequency of term 119901 in the 119895thdocument119908119901119895 Weight associated with term 119901 in the 119895thdocument

m119888 Prototype vector associated with the BMUm119894 Prototype vector of index 119894r119888 Position vector of index 119888r119894 Position vector of index 119894x119895 Numerical feature vector associated with119895th document

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research has been funded by MIUR ITIA CNR andCluster Fabbrica Intelligente (CFI) Sustainable Manufactur-ing project (CTN01 00163 148175) and Vis4Factory project(Sistemi Informativi Visuali per i processi di fabbrica nelsettore dei trasporti PON02 00634 3551288)The authors arethankful to the technical and managerial staff of MerMecSpA (Italy) for providing the industrial case study of thisresearch

References

[1] T Jarratt J Clarkson and C E Eckert Design Process Improve-ment Springer New York NY USA 2005

[2] C Wanstrom and P Jonsson ldquoThe impact of engineeringchanges on materials planningrdquo Journal of Manufacturing Tech-nology Management vol 17 no 5 pp 561ndash584 2006

[3] S Leng L Wang G Chen and D Tang ldquoEngineering changeinformation propagation in aviation industrial manufacturingexecution processesrdquo The International Journal of AdvancedManufacturing Technology vol 83 no 1 pp 575ndash585 2016

[4] W-H Wu L-C Fang W-Y Wang M-C Yu and H-Y KaoldquoAn advanced CMII-based engineering change managementframework the integration of PLM and ERP perspectivesrdquoInternational Journal of Production Research vol 52 no 20 pp6092ndash6109 2014

Computational Intelligence and Neuroscience 11

[5] T A W Jarratt C M Eckert N H M Caldwell and P JClarkson ldquoEngineering change an overview and perspective onthe literaturerdquo Research in Engineering Design vol 22 no 2 pp103ndash124 2011

[6] B Hamraz N H M Caldwell and P J Clarkson ldquoA holisticcategorization framework for literature on engineering changemanagementrdquo Systems Engineering vol 16 no 4 pp 473ndash5052013

[7] H Ritter T Martinetz and K Schulten Neural Computationand Self-Organizing Maps An Introduction Addison-WesleyNew York NY USA 1992

[8] J Vesanto J Himberg E Alhoniemi and J Parhankangas SOMtoolbox for Matlab 5 2005 httpwwwcishutfisomtoolbox

[9] E Fricke BGebhardHNegele andE Igenbergs ldquoCopingwithchanges causes findings and strategiesrdquo Systems Engineeringvol 3 no 4 pp 169ndash179 2000

[10] J Vesanto and E Alhoniemi ldquoClustering of the self-organizingmaprdquo IEEE Transactions on Neural Networks vol 11 no 3 pp586ndash600 2000

[11] A Sharafi P Wolf and H Krcmar ldquoKnowledge discovery indatabases on the example of engineering change managementrdquoin Proceedings of the Industrial Conference on Data Mining-Poster and Industry Proceedings pp 9ndash16 IBaI July 2010

[12] F Elezi A Sharafi A Mirson P Wolf H Krcmar and ULindemann ldquoA Knowledge Discovery in Databases (KDD)approach for extracting causes of iterations in EngineeringChange Ordersrdquo in Proceedings of the ASME InternationalDesign Engineering Technical Conferences and Computers andInformation in Engineering Conference pp 1401ndash1410 2011

[13] A Sharafi Knowledge Discovery in Databases an Analysis ofChange Management in Product Development Springer 2013

[14] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo Biological Cybernetics vol 43 no 1 pp 59ndash691982

[15] T Kohonen ldquoEssentials of the self-organizing maprdquo NeuralNetworks vol 37 pp 52ndash65 2013

[16] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[17] Y-C Liu M Liu and X-L Wang ldquoApplication of self-organizing maps in text clustering a reviewrdquo in Applicationsof Self-Organizing Maps M Johnsson Ed chapter 10 InTechRijeka Croatia 2012

[18] A Ultsch and H P Siemon ldquoKohonenrsquos self organizing featuremaps for exploratory data analysisrdquo in Proceedings of theProceedings of International Neural Networks Conference (INNCrsquo90) pp 305ndash308 Kluwer Paris France 1990

[19] J Vesanto ldquoSOM-based data visualization methodsrdquo IntelligentData Analysis vol 3 no 2 pp 111ndash126 1999

[20] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 1967

[21] D L Davies and DW Bouldin ldquoA cluster separation measurerdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 1 no 2 pp 224ndash227 1979

[22] N Yorek I Ugulu and H Aydin ldquoUsing self-organizingneural network map combined with Wardrsquos clustering algo-rithm for visualization of studentsrsquo cognitive structural modelsabout aliveness conceptrdquoComputational Intelligence andNeuro-science vol 2016 Article ID 2476256 14 pages 2016

[23] T Honkela S Kaski K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo in Proceedingsof theWork shop on Self-OrganizingMaps (WSOM rsquo97) pp 310ndash315 Espoo Finland June 1997

[24] S Kaski T Honkela K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo Neurocomput-ing vol 21 no 1ndash3 pp 101ndash117 1998

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information ProcessingampManagement vol24 no 5 pp 513ndash523 1988

[26] G Salton and R K Waldstein ldquoTerm relevance weights in on-line information retrievalrdquo Information Processing amp Manage-ment vol 14 no 1 pp 29ndash35 1978

[27] Y C Liu X Wang and C Wu ldquoConSOM a conceptional self-organizingmapmodel for text clusteringrdquoNeurocomputing vol71 no 4ndash6 pp 857ndash862 2008

[28] Y-C Liu C Wu and M Liu ldquoResearch of fast SOM clusteringfor text informationrdquo Expert Systems with Applications vol 38no 8 pp 9325ndash9333 2011

[29] T Kohonen S Kaski K Lagus et al ldquoSelf organization of amassive document collectionrdquo IEEE Transactions on NeuralNetworks vol 11 no 3 pp 574ndash585 2000

[30] J Zhu and S Liu ldquoSOMnetwork based clustering analysis of realestate enterprisesrdquo American Journal of Industrial and BusinessManagement vol 4 no 3 pp 167ndash173 2014

[31] B C Till J Longo A R Dobell and P F Driessen ldquoSelf-organizing maps for latent semantic analysis of free-form textin support of public policy analysisrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 1pp 71ndash86 2014

[32] G Q Huang W Y Yee and K L Mak ldquoCurrent practice ofengineering changemanagement in Hong Kongmanufacturingindustriesrdquo Journal of Materials Processing Technology vol 139no 1ndash3 pp 481ndash487 2003

[33] S Angers ldquoChanging the rules of the lsquoChangersquo gamemdashem-ployees and suppliers work together to transform the 737757production systemrdquo 2002 httpwwwboeingcomnewsfron-tiersarchive2002mayi ca2html

[34] E Subrahmanian C Lee and H Granger ldquoManaging andsupporting product life cycle through engineering changemanagement for a complex productrdquo Research in EngineeringDesign vol 26 no 3 pp 189ndash217 2015

[35] D Zeimpekis and E Gallopoulos TMG A Matlab toolboxfor generating term-document matrices from text collections2006 httpscgroup20ceidupatrasgr8000tmg

[36] M J Zaki and W Meira Jr Data Mining and AnalysisFundamental Concepts and Algorithms Cambridge UniversityPress 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

Computational Intelligence and Neuroscience 7

Term p = 48

(a)d

0

05

1

Term p = 195

(b)

Figure 4 Component Planes of two variables chosen as example (a) Component Plane related to the variable of index 119901 = 48 correspondingto the term ldquoAntennardquo (b) Component Plane of the variable of index 119901 = 195 corresponding to the term ldquoMetal-Sheetrdquo (a) and (b) are linkedby position in each figure the hexagon in a certain position corresponds to the same map units Each Component Plane shows the values ofthe variable in each map unit using color-coding Letter ldquodrdquo means that denormalized values are used that is the values in the original datacontext are used [8]

1 2 3 4 5 6 708

09

1

11

12

13

14

15

16

17

18

K

DBI

(K)

(a)

5

3

3

6

1

2

1

2

3

1

1

1

1

2

2

1

1

6

4

3

3

2

(b)

Figure 5 (a) Davies Bouldin Index (DBI) values for number of clusters119870 in [2 7] (b) Clustering result of SOM using119870-means with119870lowast = 5clusters In each hexagon the number of documents sharing the same BMU is provided

given by process operators in fact ECR documents sharingthe same classification label are included in BMUs belongingto the same cluster It is worth noting that the actual labelis assigned to each document after the SOM unsupervisedtraining has been carried out From Figure 6 it can be also

noted that ECRs classified either as ldquoPSrdquo and ldquoHPrdquo are allincluded in a unique cluster Furthermore two documentsout of twelve with label ldquoMSrdquo are assigned to clusters thatinclude different labels namely ldquoCRrdquo ldquoPSrdquo and ldquoHPrdquo Weinvestigated this misclassification and we found that causes

8 Computational Intelligence and Neuroscience

AN(5)

CR(3)

CR(3)

SR(6)

AN(1)

CR(2)

SR(1)

AN(2)

PS(3)

HP(1)

MS(1)

MS(1)

PS(1)

HP(2)

MS(2)

MS(1)

MS(1)

PS(6)

HP(4)

HP(2)MS(1)

MS(3)

MS(2)

Figure 6 Labeled map grid of SOM In each hexagon labels andtotal number of documents sharing the same BMU Coloring refersto 119870-means unsupervised clustering of the SOM

Table 2 Contingency matrix N

MS CR AN SR HP PS 119899119903T1 T2 T3 T4 T5 T61198621 0 0 8 0 0 0 81198622 1 0 0 0 9 10 201198623 1 8 0 0 0 0 91198624 0 0 0 7 0 0 71198625 10 0 0 0 0 0 10119898119896 12 8 8 7 9 10 119873 = 54

of changes described in these two ldquoMSrdquo documents are quitesimilar to those contained in documents labeled as ldquoCRrdquoldquoPSrdquo and ldquoHPrdquo

Given the actual classification the quality of the obtainedclustering can be evaluated by computing four indices purityprecision recall and 119865-measure [36]

Let T = T1 T119903 T119877 be the true partitioninggiven by the process operators where the partition T119903consists of all the documents with label 119903 Let 119898119903 = |T119903|denote the number of documents in true partition T119903 Alsolet C = C1 C119896 C119870 denote the clustering obtainedvia the SOM text clustering algorithm and 119899119896 = |C119896|denote the number of documents in cluster C119896 The 119870 times 119877contingency matrix N induced by clustering C and the truepartitioning T can be obtained by computing the elementsN(119896 119903) = 119899119896119903 = |C119896 cap T119903| where 119899119896119903 denotes the number ofdocuments that are common to clusterC119896 and true partitionT119903 The contingency matrix for SOM text clustering of ECRsis reported in Table 2

Starting fromTable 2 the following indices are computed

(i) Purity Index The cluster-specific purity index ofclusterC119896 is defined as

purity119896 = 1119899119896119877max119903=1

119899119896119903 (8)

The overall purity index of the entire clustering C iscomputed as

purity = 119870sum119896=1

119899119896119873purity119896 = 1119873119870sum119896=1

119877max119903=1

119899119896119903 (9)

As shown in Table 3 clusters C1 C4 and C5 havepurity index equal to 1 that is they contain entitiesfrom only one partition Clusters C2 and C3 gatherentities from different partitions that isT1T5 andT6 for cluster C2 and T1 T2 for cluster C3 Theoverall purity index is equal to 079

(ii) Precision Index Given a clusterC119896 letT119903119896 denote themajority partition that contains the maximum num-ber of documents fromC119896 that is 119903119896 = argmax119877119903=1119899119896119903The precision index of a clusterC119896 is given by

prec119896 = 1119899119896119877max119903=1

119899119896119903 = 119899119896119903119896119899119896 (10)

For clustering in Table 2 the majority partitionsare T31 T62 T23 T44 and T15 Precision indicesshow that all documents gathered in clusters C1C4 and C5 belong to the corresponding majoritypartitionsT31 T44 andT15 For clusterC2 the 50of documents belong to T62 and finally the 88 ofdocuments in clusterC3 belong toT23

(iii) Recall Index Given a clusterC119896 it is defined as

recall119896 = 11989911989611990311989610038161003816100381610038161003816T119903119896 10038161003816100381610038161003816 = 119899119896119903119896119898119903119896 (11)

where 119898119903119896 = |T119903119896 | It measures the fraction ofdocuments in partition T119903119896 shared in common withclusterC119896The recall indices reported in Table 3 showthat clusters C1 C2 C3 and C4 shared in commonthe 100 of documents in majority partitions T31 T62 T23 andT44 respectively ClusterC5 shared the83 of documents inT15

(iv) 119865-Measure Index It is the harmonic mean of theprecision and recall values for each cluster The 119865-measure for cluster 119862119896 is therefore given as

119865119896 = 2 sdot prec119896 sdot recall119896prec119896 + recall119896

= 2119899119896119903119896119899119896 + 119898119903119896 (12)

The overall 119865-measure for the clustering C is themean of the clusterwise 119865-measure values

119865 = 1119870119870sum119896=1

119865119896 (13)

Table 3 shows that 119865-measure of clusters C1 and C4is equal to 1 while other values are less than 1 Thelow values of 119865-measures for clusters C2 C3 andC5 depend on a low precision index for clusters C2and C3 and on a low recall index for cluster C5Consequently the overall 119865-measure is equal to 090

Computational Intelligence and Neuroscience 9

Table 3 Clustering validation

Cluster-specific and overall indicesPurity

purity1 = 1 purity2 = 05 purity3 = 089 purity4 = 1 purity5 = 1purity = 079Precision

prec1 = 1 prec2 = 05 prec3 = 089 prec4 = 1 prec5 = 1Recall

recall1 = 1 recall2 = 1 recall3 = 1 recall4 = 1 recall5 = 083F-measure

1198651 = 1 1198652 = 066 1198653 = 094 1198654 = 1 1198655 = 091119865 = 090Table 4 Results of SOM-based classification In the first and in the second column labels and number of ECR texts of the actual classificationare reported In the third column the number of ECRs correctly classified through the label of first associated BMU In the fourth column thenumber of document associated with an empty first BMU but correctly classified by considering the label of documents sharing the secondassociated BMU

Labels Number of ECRs Number of ECRs classified through the 1st BMU Number of ECRs classified through the 2nd BMUMS 12 10 2CR 8 6 2AN 8 7 1SR 7 6 1HP 9 6 3PS 10 10 mdash

Given the actual classification the SOM can be furthervalidated through a leave-one-out cross validation techniquein order to check its classification ability In particular119873 minus 1ECR documents are used for training and the remaining onefor testing (iterating until each ECR text in the data has beenused for testing)

At each iteration once SOM has been trained on 119873 minus 1ECRs and when the testing sample is presented as input aBMU is selected in the matching step of the SOM algorithmThe label of the training documents associatedwith that BMUis considered In the case of an empty BMU that is whichresults are not associated with any training documents theclosest one associated with at least one training document isconsidered instead while in case of a BMU associated withtraining documents with more than one label the label withgreater number of documents is considered

Table 4 shows the results of the leave-one-out cross vali-dation For each row that is for a given ECR label the secondcolumn reports the total number of documents in the datasetwhile the last two columns report the number of testingECRs correctly classified by the SOM In particular thethird column reports the number of testing ECRs correctlyclassified as they were connected to a first BMU associatedwith training documents with the same labelThe last columnrefers to the number of ECRs associated with an empty firstBMU that nevertheless resulted closest to a second BMUrelated to documents belonging to the same class of thetesting sample Also cross validation study demonstrates thatlabels given by SOM are coherent with actual classificationand confirms the ability of SOM as classification tool

6 Conclusions

In this paper a real case study regarding the engineeringchange process in complex products industry was conductedThe study concerned the postchange stage of the engineeringchange process in which past engineering changes dataare analyzed to discover information exploitable in newengineering changes In particular SOM was used for clus-tering of natural language written texts produced during theengineering change process The analyzed texts included thedescriptions of the causes of changes contained in the ECRforms Firstly SOM algorithm was used as clustering tool tofind relationships between the ECR texts and to cluster themaccordingly Subsequently SOM was tested as classificationtool and the results were validated through a leave-one-outcross validation technique

The results of the real case study showed that the use of theSOM text clustering can be an effective tool in improvementof the engineering change process analysis In particularsome of the advantages highlighted in this study are asfollows

(1) Text mining methods allow analyzing unstructureddata and deriving high-quality informationThemaindifficulty in ECR analysis consisted in analyzingnatural language written texts

(2) Clustering analysis of past ECRs stored in the com-pany allows automatically gathering ECRs on thebasis of similarity between documents When a newchange triggers the company can quickly focus on

10 Computational Intelligence and Neuroscience

the cluster of interest Clustering can support thecompany to know if a similar change was alreadymanaged in the past to analyze the best solutionadopted and to learn avoiding the same mistakesmade in the past

(3) Use of SOM for ECRs text clustering allows automat-ically organizing large documents collection Withrespect to other clustering algorithms the mainadvantage of SOM text clustering is that the similarityof the texts is preserved in the spatial organization ofthe neurons The distance among prototypes in theSOMmap can therefore be considered as an estimateof the similarity between documents belonging toclusters In addition SOM can first be computedusing a representative subset of old input data Newinput can be mapped straight into the most similarmodel without recomputing the whole mapping

Nevertheless the study showed some limitations of theapplication of SOM text clustering and classification A firstlimitation is linked to natural language written texts Theterms contained in different texts may be similar even if anengineering change request concerns a different productThesimilarity of terms may influence the performance of SOM-based clustering A second limitation is linked to the use ofSOMas classificationmethod Classification indeed requiresthe labeling of a training datasetThis activity requires a deepknowledge of the different kinds of ECRs managed duringthe engineering change process andmay be difficult and timeconsuming

As a summary research on use of SOM text clusteringin engineering change process analysis appears to be apromising direction for further research A future directionof the work will consider the use of SOM text clustering ona larger dataset of ECRs comparing SOM with other clus-tering algorithms such as119870-means or hierarchical clusteringmethods Another direction of future research concerns theanalysis of SOM robustness to parameters selection (iethe size and structure of the map parameters and kinds oflearning and neighborhood functions)

Symbols

120572(119905) Scalar-valued learning-rate factorN Contingency matrix120575(C119896C119897) Between-cluster distanceΔ(C119896) Within-cluster distanceΔm119894(119905) Adjustment of the prototype vectorm119894(119905)C Collection of documentsC119896 Cluster of documents of index 119896T True partition of documents given by

process operatorsT119903 True partition of documents with label 119903T119903119896 Majority partition containing the

maximum number of documents fromC119896120590(119905) Width of the kernel which corresponds tothe radius of 119873119888(119905)119888(x119895) Index of the BMU associated with x119895119889119895 Document of index 119895

DBI(119870) Davies Bouldin Index for 119870 clusters119889119891119901 Number of documents in the collectionCwhich contains term 119901ℎ119888119894(119905) Neighborhood function119870 Total number of clusters119870lowast Optimum number of clusters119872 Total number of terms in the documentscollection119898119903 Number of documents in partitionT119903119873 Total number of processed documents119873119888(119905) Neighborhood around node 119888119899119896 Number of documents in clusterC119896119899119896119903 Number of documents common to clusterC119896 and partitionT119903119875 Total number of units of SOM119905 Index of time119905119891119901119895 Term frequency of term 119901 in the 119895thdocument119908119901119895 Weight associated with term 119901 in the 119895thdocument

m119888 Prototype vector associated with the BMUm119894 Prototype vector of index 119894r119888 Position vector of index 119888r119894 Position vector of index 119894x119895 Numerical feature vector associated with119895th document

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research has been funded by MIUR ITIA CNR andCluster Fabbrica Intelligente (CFI) Sustainable Manufactur-ing project (CTN01 00163 148175) and Vis4Factory project(Sistemi Informativi Visuali per i processi di fabbrica nelsettore dei trasporti PON02 00634 3551288)The authors arethankful to the technical and managerial staff of MerMecSpA (Italy) for providing the industrial case study of thisresearch

References

[1] T Jarratt J Clarkson and C E Eckert Design Process Improve-ment Springer New York NY USA 2005

[2] C Wanstrom and P Jonsson ldquoThe impact of engineeringchanges on materials planningrdquo Journal of Manufacturing Tech-nology Management vol 17 no 5 pp 561ndash584 2006

[3] S Leng L Wang G Chen and D Tang ldquoEngineering changeinformation propagation in aviation industrial manufacturingexecution processesrdquo The International Journal of AdvancedManufacturing Technology vol 83 no 1 pp 575ndash585 2016

[4] W-H Wu L-C Fang W-Y Wang M-C Yu and H-Y KaoldquoAn advanced CMII-based engineering change managementframework the integration of PLM and ERP perspectivesrdquoInternational Journal of Production Research vol 52 no 20 pp6092ndash6109 2014

Computational Intelligence and Neuroscience 11

[5] T A W Jarratt C M Eckert N H M Caldwell and P JClarkson ldquoEngineering change an overview and perspective onthe literaturerdquo Research in Engineering Design vol 22 no 2 pp103ndash124 2011

[6] B Hamraz N H M Caldwell and P J Clarkson ldquoA holisticcategorization framework for literature on engineering changemanagementrdquo Systems Engineering vol 16 no 4 pp 473ndash5052013

[7] H Ritter T Martinetz and K Schulten Neural Computationand Self-Organizing Maps An Introduction Addison-WesleyNew York NY USA 1992

[8] J Vesanto J Himberg E Alhoniemi and J Parhankangas SOMtoolbox for Matlab 5 2005 httpwwwcishutfisomtoolbox

[9] E Fricke BGebhardHNegele andE Igenbergs ldquoCopingwithchanges causes findings and strategiesrdquo Systems Engineeringvol 3 no 4 pp 169ndash179 2000

[10] J Vesanto and E Alhoniemi ldquoClustering of the self-organizingmaprdquo IEEE Transactions on Neural Networks vol 11 no 3 pp586ndash600 2000

[11] A Sharafi P Wolf and H Krcmar ldquoKnowledge discovery indatabases on the example of engineering change managementrdquoin Proceedings of the Industrial Conference on Data Mining-Poster and Industry Proceedings pp 9ndash16 IBaI July 2010

[12] F Elezi A Sharafi A Mirson P Wolf H Krcmar and ULindemann ldquoA Knowledge Discovery in Databases (KDD)approach for extracting causes of iterations in EngineeringChange Ordersrdquo in Proceedings of the ASME InternationalDesign Engineering Technical Conferences and Computers andInformation in Engineering Conference pp 1401ndash1410 2011

[13] A Sharafi Knowledge Discovery in Databases an Analysis ofChange Management in Product Development Springer 2013

[14] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo Biological Cybernetics vol 43 no 1 pp 59ndash691982

[15] T Kohonen ldquoEssentials of the self-organizing maprdquo NeuralNetworks vol 37 pp 52ndash65 2013

[16] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[17] Y-C Liu M Liu and X-L Wang ldquoApplication of self-organizing maps in text clustering a reviewrdquo in Applicationsof Self-Organizing Maps M Johnsson Ed chapter 10 InTechRijeka Croatia 2012

[18] A Ultsch and H P Siemon ldquoKohonenrsquos self organizing featuremaps for exploratory data analysisrdquo in Proceedings of theProceedings of International Neural Networks Conference (INNCrsquo90) pp 305ndash308 Kluwer Paris France 1990

[19] J Vesanto ldquoSOM-based data visualization methodsrdquo IntelligentData Analysis vol 3 no 2 pp 111ndash126 1999

[20] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 1967

[21] D L Davies and DW Bouldin ldquoA cluster separation measurerdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 1 no 2 pp 224ndash227 1979

[22] N Yorek I Ugulu and H Aydin ldquoUsing self-organizingneural network map combined with Wardrsquos clustering algo-rithm for visualization of studentsrsquo cognitive structural modelsabout aliveness conceptrdquoComputational Intelligence andNeuro-science vol 2016 Article ID 2476256 14 pages 2016

[23] T Honkela S Kaski K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo in Proceedingsof theWork shop on Self-OrganizingMaps (WSOM rsquo97) pp 310ndash315 Espoo Finland June 1997

[24] S Kaski T Honkela K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo Neurocomput-ing vol 21 no 1ndash3 pp 101ndash117 1998

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information ProcessingampManagement vol24 no 5 pp 513ndash523 1988

[26] G Salton and R K Waldstein ldquoTerm relevance weights in on-line information retrievalrdquo Information Processing amp Manage-ment vol 14 no 1 pp 29ndash35 1978

[27] Y C Liu X Wang and C Wu ldquoConSOM a conceptional self-organizingmapmodel for text clusteringrdquoNeurocomputing vol71 no 4ndash6 pp 857ndash862 2008

[28] Y-C Liu C Wu and M Liu ldquoResearch of fast SOM clusteringfor text informationrdquo Expert Systems with Applications vol 38no 8 pp 9325ndash9333 2011

[29] T Kohonen S Kaski K Lagus et al ldquoSelf organization of amassive document collectionrdquo IEEE Transactions on NeuralNetworks vol 11 no 3 pp 574ndash585 2000

[30] J Zhu and S Liu ldquoSOMnetwork based clustering analysis of realestate enterprisesrdquo American Journal of Industrial and BusinessManagement vol 4 no 3 pp 167ndash173 2014

[31] B C Till J Longo A R Dobell and P F Driessen ldquoSelf-organizing maps for latent semantic analysis of free-form textin support of public policy analysisrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 1pp 71ndash86 2014

[32] G Q Huang W Y Yee and K L Mak ldquoCurrent practice ofengineering changemanagement in Hong Kongmanufacturingindustriesrdquo Journal of Materials Processing Technology vol 139no 1ndash3 pp 481ndash487 2003

[33] S Angers ldquoChanging the rules of the lsquoChangersquo gamemdashem-ployees and suppliers work together to transform the 737757production systemrdquo 2002 httpwwwboeingcomnewsfron-tiersarchive2002mayi ca2html

[34] E Subrahmanian C Lee and H Granger ldquoManaging andsupporting product life cycle through engineering changemanagement for a complex productrdquo Research in EngineeringDesign vol 26 no 3 pp 189ndash217 2015

[35] D Zeimpekis and E Gallopoulos TMG A Matlab toolboxfor generating term-document matrices from text collections2006 httpscgroup20ceidupatrasgr8000tmg

[36] M J Zaki and W Meira Jr Data Mining and AnalysisFundamental Concepts and Algorithms Cambridge UniversityPress 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

8 Computational Intelligence and Neuroscience

AN(5)

CR(3)

CR(3)

SR(6)

AN(1)

CR(2)

SR(1)

AN(2)

PS(3)

HP(1)

MS(1)

MS(1)

PS(1)

HP(2)

MS(2)

MS(1)

MS(1)

PS(6)

HP(4)

HP(2)MS(1)

MS(3)

MS(2)

Figure 6 Labeled map grid of SOM In each hexagon labels andtotal number of documents sharing the same BMU Coloring refersto 119870-means unsupervised clustering of the SOM

Table 2 Contingency matrix N

MS CR AN SR HP PS 119899119903T1 T2 T3 T4 T5 T61198621 0 0 8 0 0 0 81198622 1 0 0 0 9 10 201198623 1 8 0 0 0 0 91198624 0 0 0 7 0 0 71198625 10 0 0 0 0 0 10119898119896 12 8 8 7 9 10 119873 = 54

of changes described in these two ldquoMSrdquo documents are quitesimilar to those contained in documents labeled as ldquoCRrdquoldquoPSrdquo and ldquoHPrdquo

Given the actual classification the quality of the obtainedclustering can be evaluated by computing four indices purityprecision recall and 119865-measure [36]

Let T = T1 T119903 T119877 be the true partitioninggiven by the process operators where the partition T119903consists of all the documents with label 119903 Let 119898119903 = |T119903|denote the number of documents in true partition T119903 Alsolet C = C1 C119896 C119870 denote the clustering obtainedvia the SOM text clustering algorithm and 119899119896 = |C119896|denote the number of documents in cluster C119896 The 119870 times 119877contingency matrix N induced by clustering C and the truepartitioning T can be obtained by computing the elementsN(119896 119903) = 119899119896119903 = |C119896 cap T119903| where 119899119896119903 denotes the number ofdocuments that are common to clusterC119896 and true partitionT119903 The contingency matrix for SOM text clustering of ECRsis reported in Table 2

Starting fromTable 2 the following indices are computed

(i) Purity Index The cluster-specific purity index ofclusterC119896 is defined as

purity119896 = 1119899119896119877max119903=1

119899119896119903 (8)

The overall purity index of the entire clustering C iscomputed as

purity = 119870sum119896=1

119899119896119873purity119896 = 1119873119870sum119896=1

119877max119903=1

119899119896119903 (9)

As shown in Table 3 clusters C1 C4 and C5 havepurity index equal to 1 that is they contain entitiesfrom only one partition Clusters C2 and C3 gatherentities from different partitions that isT1T5 andT6 for cluster C2 and T1 T2 for cluster C3 Theoverall purity index is equal to 079

(ii) Precision Index Given a clusterC119896 letT119903119896 denote themajority partition that contains the maximum num-ber of documents fromC119896 that is 119903119896 = argmax119877119903=1119899119896119903The precision index of a clusterC119896 is given by

prec119896 = 1119899119896119877max119903=1

119899119896119903 = 119899119896119903119896119899119896 (10)

For clustering in Table 2 the majority partitionsare T31 T62 T23 T44 and T15 Precision indicesshow that all documents gathered in clusters C1C4 and C5 belong to the corresponding majoritypartitionsT31 T44 andT15 For clusterC2 the 50of documents belong to T62 and finally the 88 ofdocuments in clusterC3 belong toT23

(iii) Recall Index Given a clusterC119896 it is defined as

recall119896 = 11989911989611990311989610038161003816100381610038161003816T119903119896 10038161003816100381610038161003816 = 119899119896119903119896119898119903119896 (11)

where 119898119903119896 = |T119903119896 | It measures the fraction ofdocuments in partition T119903119896 shared in common withclusterC119896The recall indices reported in Table 3 showthat clusters C1 C2 C3 and C4 shared in commonthe 100 of documents in majority partitions T31 T62 T23 andT44 respectively ClusterC5 shared the83 of documents inT15

(iv) 119865-Measure Index It is the harmonic mean of theprecision and recall values for each cluster The 119865-measure for cluster 119862119896 is therefore given as

119865119896 = 2 sdot prec119896 sdot recall119896prec119896 + recall119896

= 2119899119896119903119896119899119896 + 119898119903119896 (12)

The overall 119865-measure for the clustering C is themean of the clusterwise 119865-measure values

119865 = 1119870119870sum119896=1

119865119896 (13)

Table 3 shows that 119865-measure of clusters C1 and C4is equal to 1 while other values are less than 1 Thelow values of 119865-measures for clusters C2 C3 andC5 depend on a low precision index for clusters C2and C3 and on a low recall index for cluster C5Consequently the overall 119865-measure is equal to 090

Computational Intelligence and Neuroscience 9

Table 3 Clustering validation

Cluster-specific and overall indicesPurity

purity1 = 1 purity2 = 05 purity3 = 089 purity4 = 1 purity5 = 1purity = 079Precision

prec1 = 1 prec2 = 05 prec3 = 089 prec4 = 1 prec5 = 1Recall

recall1 = 1 recall2 = 1 recall3 = 1 recall4 = 1 recall5 = 083F-measure

1198651 = 1 1198652 = 066 1198653 = 094 1198654 = 1 1198655 = 091119865 = 090Table 4 Results of SOM-based classification In the first and in the second column labels and number of ECR texts of the actual classificationare reported In the third column the number of ECRs correctly classified through the label of first associated BMU In the fourth column thenumber of document associated with an empty first BMU but correctly classified by considering the label of documents sharing the secondassociated BMU

Labels Number of ECRs Number of ECRs classified through the 1st BMU Number of ECRs classified through the 2nd BMUMS 12 10 2CR 8 6 2AN 8 7 1SR 7 6 1HP 9 6 3PS 10 10 mdash

Given the actual classification the SOM can be furthervalidated through a leave-one-out cross validation techniquein order to check its classification ability In particular119873 minus 1ECR documents are used for training and the remaining onefor testing (iterating until each ECR text in the data has beenused for testing)

At each iteration once SOM has been trained on 119873 minus 1ECRs and when the testing sample is presented as input aBMU is selected in the matching step of the SOM algorithmThe label of the training documents associatedwith that BMUis considered In the case of an empty BMU that is whichresults are not associated with any training documents theclosest one associated with at least one training document isconsidered instead while in case of a BMU associated withtraining documents with more than one label the label withgreater number of documents is considered

Table 4 shows the results of the leave-one-out cross vali-dation For each row that is for a given ECR label the secondcolumn reports the total number of documents in the datasetwhile the last two columns report the number of testingECRs correctly classified by the SOM In particular thethird column reports the number of testing ECRs correctlyclassified as they were connected to a first BMU associatedwith training documents with the same labelThe last columnrefers to the number of ECRs associated with an empty firstBMU that nevertheless resulted closest to a second BMUrelated to documents belonging to the same class of thetesting sample Also cross validation study demonstrates thatlabels given by SOM are coherent with actual classificationand confirms the ability of SOM as classification tool

6 Conclusions

In this paper a real case study regarding the engineeringchange process in complex products industry was conductedThe study concerned the postchange stage of the engineeringchange process in which past engineering changes dataare analyzed to discover information exploitable in newengineering changes In particular SOM was used for clus-tering of natural language written texts produced during theengineering change process The analyzed texts included thedescriptions of the causes of changes contained in the ECRforms Firstly SOM algorithm was used as clustering tool tofind relationships between the ECR texts and to cluster themaccordingly Subsequently SOM was tested as classificationtool and the results were validated through a leave-one-outcross validation technique

The results of the real case study showed that the use of theSOM text clustering can be an effective tool in improvementof the engineering change process analysis In particularsome of the advantages highlighted in this study are asfollows

(1) Text mining methods allow analyzing unstructureddata and deriving high-quality informationThemaindifficulty in ECR analysis consisted in analyzingnatural language written texts

(2) Clustering analysis of past ECRs stored in the com-pany allows automatically gathering ECRs on thebasis of similarity between documents When a newchange triggers the company can quickly focus on

10 Computational Intelligence and Neuroscience

the cluster of interest Clustering can support thecompany to know if a similar change was alreadymanaged in the past to analyze the best solutionadopted and to learn avoiding the same mistakesmade in the past

(3) Use of SOM for ECRs text clustering allows automat-ically organizing large documents collection Withrespect to other clustering algorithms the mainadvantage of SOM text clustering is that the similarityof the texts is preserved in the spatial organization ofthe neurons The distance among prototypes in theSOMmap can therefore be considered as an estimateof the similarity between documents belonging toclusters In addition SOM can first be computedusing a representative subset of old input data Newinput can be mapped straight into the most similarmodel without recomputing the whole mapping

Nevertheless the study showed some limitations of theapplication of SOM text clustering and classification A firstlimitation is linked to natural language written texts Theterms contained in different texts may be similar even if anengineering change request concerns a different productThesimilarity of terms may influence the performance of SOM-based clustering A second limitation is linked to the use ofSOMas classificationmethod Classification indeed requiresthe labeling of a training datasetThis activity requires a deepknowledge of the different kinds of ECRs managed duringthe engineering change process andmay be difficult and timeconsuming

As a summary research on use of SOM text clusteringin engineering change process analysis appears to be apromising direction for further research A future directionof the work will consider the use of SOM text clustering ona larger dataset of ECRs comparing SOM with other clus-tering algorithms such as119870-means or hierarchical clusteringmethods Another direction of future research concerns theanalysis of SOM robustness to parameters selection (iethe size and structure of the map parameters and kinds oflearning and neighborhood functions)

Symbols

120572(119905) Scalar-valued learning-rate factorN Contingency matrix120575(C119896C119897) Between-cluster distanceΔ(C119896) Within-cluster distanceΔm119894(119905) Adjustment of the prototype vectorm119894(119905)C Collection of documentsC119896 Cluster of documents of index 119896T True partition of documents given by

process operatorsT119903 True partition of documents with label 119903T119903119896 Majority partition containing the

maximum number of documents fromC119896120590(119905) Width of the kernel which corresponds tothe radius of 119873119888(119905)119888(x119895) Index of the BMU associated with x119895119889119895 Document of index 119895

DBI(119870) Davies Bouldin Index for 119870 clusters119889119891119901 Number of documents in the collectionCwhich contains term 119901ℎ119888119894(119905) Neighborhood function119870 Total number of clusters119870lowast Optimum number of clusters119872 Total number of terms in the documentscollection119898119903 Number of documents in partitionT119903119873 Total number of processed documents119873119888(119905) Neighborhood around node 119888119899119896 Number of documents in clusterC119896119899119896119903 Number of documents common to clusterC119896 and partitionT119903119875 Total number of units of SOM119905 Index of time119905119891119901119895 Term frequency of term 119901 in the 119895thdocument119908119901119895 Weight associated with term 119901 in the 119895thdocument

m119888 Prototype vector associated with the BMUm119894 Prototype vector of index 119894r119888 Position vector of index 119888r119894 Position vector of index 119894x119895 Numerical feature vector associated with119895th document

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research has been funded by MIUR ITIA CNR andCluster Fabbrica Intelligente (CFI) Sustainable Manufactur-ing project (CTN01 00163 148175) and Vis4Factory project(Sistemi Informativi Visuali per i processi di fabbrica nelsettore dei trasporti PON02 00634 3551288)The authors arethankful to the technical and managerial staff of MerMecSpA (Italy) for providing the industrial case study of thisresearch

References

[1] T Jarratt J Clarkson and C E Eckert Design Process Improve-ment Springer New York NY USA 2005

[2] C Wanstrom and P Jonsson ldquoThe impact of engineeringchanges on materials planningrdquo Journal of Manufacturing Tech-nology Management vol 17 no 5 pp 561ndash584 2006

[3] S Leng L Wang G Chen and D Tang ldquoEngineering changeinformation propagation in aviation industrial manufacturingexecution processesrdquo The International Journal of AdvancedManufacturing Technology vol 83 no 1 pp 575ndash585 2016

[4] W-H Wu L-C Fang W-Y Wang M-C Yu and H-Y KaoldquoAn advanced CMII-based engineering change managementframework the integration of PLM and ERP perspectivesrdquoInternational Journal of Production Research vol 52 no 20 pp6092ndash6109 2014

Computational Intelligence and Neuroscience 11

[5] T A W Jarratt C M Eckert N H M Caldwell and P JClarkson ldquoEngineering change an overview and perspective onthe literaturerdquo Research in Engineering Design vol 22 no 2 pp103ndash124 2011

[6] B Hamraz N H M Caldwell and P J Clarkson ldquoA holisticcategorization framework for literature on engineering changemanagementrdquo Systems Engineering vol 16 no 4 pp 473ndash5052013

[7] H Ritter T Martinetz and K Schulten Neural Computationand Self-Organizing Maps An Introduction Addison-WesleyNew York NY USA 1992

[8] J Vesanto J Himberg E Alhoniemi and J Parhankangas SOMtoolbox for Matlab 5 2005 httpwwwcishutfisomtoolbox

[9] E Fricke BGebhardHNegele andE Igenbergs ldquoCopingwithchanges causes findings and strategiesrdquo Systems Engineeringvol 3 no 4 pp 169ndash179 2000

[10] J Vesanto and E Alhoniemi ldquoClustering of the self-organizingmaprdquo IEEE Transactions on Neural Networks vol 11 no 3 pp586ndash600 2000

[11] A Sharafi P Wolf and H Krcmar ldquoKnowledge discovery indatabases on the example of engineering change managementrdquoin Proceedings of the Industrial Conference on Data Mining-Poster and Industry Proceedings pp 9ndash16 IBaI July 2010

[12] F Elezi A Sharafi A Mirson P Wolf H Krcmar and ULindemann ldquoA Knowledge Discovery in Databases (KDD)approach for extracting causes of iterations in EngineeringChange Ordersrdquo in Proceedings of the ASME InternationalDesign Engineering Technical Conferences and Computers andInformation in Engineering Conference pp 1401ndash1410 2011

[13] A Sharafi Knowledge Discovery in Databases an Analysis ofChange Management in Product Development Springer 2013

[14] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo Biological Cybernetics vol 43 no 1 pp 59ndash691982

[15] T Kohonen ldquoEssentials of the self-organizing maprdquo NeuralNetworks vol 37 pp 52ndash65 2013

[16] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[17] Y-C Liu M Liu and X-L Wang ldquoApplication of self-organizing maps in text clustering a reviewrdquo in Applicationsof Self-Organizing Maps M Johnsson Ed chapter 10 InTechRijeka Croatia 2012

[18] A Ultsch and H P Siemon ldquoKohonenrsquos self organizing featuremaps for exploratory data analysisrdquo in Proceedings of theProceedings of International Neural Networks Conference (INNCrsquo90) pp 305ndash308 Kluwer Paris France 1990

[19] J Vesanto ldquoSOM-based data visualization methodsrdquo IntelligentData Analysis vol 3 no 2 pp 111ndash126 1999

[20] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 1967

[21] D L Davies and DW Bouldin ldquoA cluster separation measurerdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 1 no 2 pp 224ndash227 1979

[22] N Yorek I Ugulu and H Aydin ldquoUsing self-organizingneural network map combined with Wardrsquos clustering algo-rithm for visualization of studentsrsquo cognitive structural modelsabout aliveness conceptrdquoComputational Intelligence andNeuro-science vol 2016 Article ID 2476256 14 pages 2016

[23] T Honkela S Kaski K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo in Proceedingsof theWork shop on Self-OrganizingMaps (WSOM rsquo97) pp 310ndash315 Espoo Finland June 1997

[24] S Kaski T Honkela K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo Neurocomput-ing vol 21 no 1ndash3 pp 101ndash117 1998

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information ProcessingampManagement vol24 no 5 pp 513ndash523 1988

[26] G Salton and R K Waldstein ldquoTerm relevance weights in on-line information retrievalrdquo Information Processing amp Manage-ment vol 14 no 1 pp 29ndash35 1978

[27] Y C Liu X Wang and C Wu ldquoConSOM a conceptional self-organizingmapmodel for text clusteringrdquoNeurocomputing vol71 no 4ndash6 pp 857ndash862 2008

[28] Y-C Liu C Wu and M Liu ldquoResearch of fast SOM clusteringfor text informationrdquo Expert Systems with Applications vol 38no 8 pp 9325ndash9333 2011

[29] T Kohonen S Kaski K Lagus et al ldquoSelf organization of amassive document collectionrdquo IEEE Transactions on NeuralNetworks vol 11 no 3 pp 574ndash585 2000

[30] J Zhu and S Liu ldquoSOMnetwork based clustering analysis of realestate enterprisesrdquo American Journal of Industrial and BusinessManagement vol 4 no 3 pp 167ndash173 2014

[31] B C Till J Longo A R Dobell and P F Driessen ldquoSelf-organizing maps for latent semantic analysis of free-form textin support of public policy analysisrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 1pp 71ndash86 2014

[32] G Q Huang W Y Yee and K L Mak ldquoCurrent practice ofengineering changemanagement in Hong Kongmanufacturingindustriesrdquo Journal of Materials Processing Technology vol 139no 1ndash3 pp 481ndash487 2003

[33] S Angers ldquoChanging the rules of the lsquoChangersquo gamemdashem-ployees and suppliers work together to transform the 737757production systemrdquo 2002 httpwwwboeingcomnewsfron-tiersarchive2002mayi ca2html

[34] E Subrahmanian C Lee and H Granger ldquoManaging andsupporting product life cycle through engineering changemanagement for a complex productrdquo Research in EngineeringDesign vol 26 no 3 pp 189ndash217 2015

[35] D Zeimpekis and E Gallopoulos TMG A Matlab toolboxfor generating term-document matrices from text collections2006 httpscgroup20ceidupatrasgr8000tmg

[36] M J Zaki and W Meira Jr Data Mining and AnalysisFundamental Concepts and Algorithms Cambridge UniversityPress 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

Computational Intelligence and Neuroscience 9

Table 3 Clustering validation

Cluster-specific and overall indicesPurity

purity1 = 1 purity2 = 05 purity3 = 089 purity4 = 1 purity5 = 1purity = 079Precision

prec1 = 1 prec2 = 05 prec3 = 089 prec4 = 1 prec5 = 1Recall

recall1 = 1 recall2 = 1 recall3 = 1 recall4 = 1 recall5 = 083F-measure

1198651 = 1 1198652 = 066 1198653 = 094 1198654 = 1 1198655 = 091119865 = 090Table 4 Results of SOM-based classification In the first and in the second column labels and number of ECR texts of the actual classificationare reported In the third column the number of ECRs correctly classified through the label of first associated BMU In the fourth column thenumber of document associated with an empty first BMU but correctly classified by considering the label of documents sharing the secondassociated BMU

Labels Number of ECRs Number of ECRs classified through the 1st BMU Number of ECRs classified through the 2nd BMUMS 12 10 2CR 8 6 2AN 8 7 1SR 7 6 1HP 9 6 3PS 10 10 mdash

Given the actual classification the SOM can be furthervalidated through a leave-one-out cross validation techniquein order to check its classification ability In particular119873 minus 1ECR documents are used for training and the remaining onefor testing (iterating until each ECR text in the data has beenused for testing)

At each iteration once SOM has been trained on 119873 minus 1ECRs and when the testing sample is presented as input aBMU is selected in the matching step of the SOM algorithmThe label of the training documents associatedwith that BMUis considered In the case of an empty BMU that is whichresults are not associated with any training documents theclosest one associated with at least one training document isconsidered instead while in case of a BMU associated withtraining documents with more than one label the label withgreater number of documents is considered

Table 4 shows the results of the leave-one-out cross vali-dation For each row that is for a given ECR label the secondcolumn reports the total number of documents in the datasetwhile the last two columns report the number of testingECRs correctly classified by the SOM In particular thethird column reports the number of testing ECRs correctlyclassified as they were connected to a first BMU associatedwith training documents with the same labelThe last columnrefers to the number of ECRs associated with an empty firstBMU that nevertheless resulted closest to a second BMUrelated to documents belonging to the same class of thetesting sample Also cross validation study demonstrates thatlabels given by SOM are coherent with actual classificationand confirms the ability of SOM as classification tool

6 Conclusions

In this paper a real case study regarding the engineeringchange process in complex products industry was conductedThe study concerned the postchange stage of the engineeringchange process in which past engineering changes dataare analyzed to discover information exploitable in newengineering changes In particular SOM was used for clus-tering of natural language written texts produced during theengineering change process The analyzed texts included thedescriptions of the causes of changes contained in the ECRforms Firstly SOM algorithm was used as clustering tool tofind relationships between the ECR texts and to cluster themaccordingly Subsequently SOM was tested as classificationtool and the results were validated through a leave-one-outcross validation technique

The results of the real case study showed that the use of theSOM text clustering can be an effective tool in improvementof the engineering change process analysis In particularsome of the advantages highlighted in this study are asfollows

(1) Text mining methods allow analyzing unstructureddata and deriving high-quality informationThemaindifficulty in ECR analysis consisted in analyzingnatural language written texts

(2) Clustering analysis of past ECRs stored in the com-pany allows automatically gathering ECRs on thebasis of similarity between documents When a newchange triggers the company can quickly focus on

10 Computational Intelligence and Neuroscience

the cluster of interest Clustering can support thecompany to know if a similar change was alreadymanaged in the past to analyze the best solutionadopted and to learn avoiding the same mistakesmade in the past

(3) Use of SOM for ECRs text clustering allows automat-ically organizing large documents collection Withrespect to other clustering algorithms the mainadvantage of SOM text clustering is that the similarityof the texts is preserved in the spatial organization ofthe neurons The distance among prototypes in theSOMmap can therefore be considered as an estimateof the similarity between documents belonging toclusters In addition SOM can first be computedusing a representative subset of old input data Newinput can be mapped straight into the most similarmodel without recomputing the whole mapping

Nevertheless the study showed some limitations of theapplication of SOM text clustering and classification A firstlimitation is linked to natural language written texts Theterms contained in different texts may be similar even if anengineering change request concerns a different productThesimilarity of terms may influence the performance of SOM-based clustering A second limitation is linked to the use ofSOMas classificationmethod Classification indeed requiresthe labeling of a training datasetThis activity requires a deepknowledge of the different kinds of ECRs managed duringthe engineering change process andmay be difficult and timeconsuming

As a summary research on use of SOM text clusteringin engineering change process analysis appears to be apromising direction for further research A future directionof the work will consider the use of SOM text clustering ona larger dataset of ECRs comparing SOM with other clus-tering algorithms such as119870-means or hierarchical clusteringmethods Another direction of future research concerns theanalysis of SOM robustness to parameters selection (iethe size and structure of the map parameters and kinds oflearning and neighborhood functions)

Symbols

120572(119905) Scalar-valued learning-rate factorN Contingency matrix120575(C119896C119897) Between-cluster distanceΔ(C119896) Within-cluster distanceΔm119894(119905) Adjustment of the prototype vectorm119894(119905)C Collection of documentsC119896 Cluster of documents of index 119896T True partition of documents given by

process operatorsT119903 True partition of documents with label 119903T119903119896 Majority partition containing the

maximum number of documents fromC119896120590(119905) Width of the kernel which corresponds tothe radius of 119873119888(119905)119888(x119895) Index of the BMU associated with x119895119889119895 Document of index 119895

DBI(119870) Davies Bouldin Index for 119870 clusters119889119891119901 Number of documents in the collectionCwhich contains term 119901ℎ119888119894(119905) Neighborhood function119870 Total number of clusters119870lowast Optimum number of clusters119872 Total number of terms in the documentscollection119898119903 Number of documents in partitionT119903119873 Total number of processed documents119873119888(119905) Neighborhood around node 119888119899119896 Number of documents in clusterC119896119899119896119903 Number of documents common to clusterC119896 and partitionT119903119875 Total number of units of SOM119905 Index of time119905119891119901119895 Term frequency of term 119901 in the 119895thdocument119908119901119895 Weight associated with term 119901 in the 119895thdocument

m119888 Prototype vector associated with the BMUm119894 Prototype vector of index 119894r119888 Position vector of index 119888r119894 Position vector of index 119894x119895 Numerical feature vector associated with119895th document

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research has been funded by MIUR ITIA CNR andCluster Fabbrica Intelligente (CFI) Sustainable Manufactur-ing project (CTN01 00163 148175) and Vis4Factory project(Sistemi Informativi Visuali per i processi di fabbrica nelsettore dei trasporti PON02 00634 3551288)The authors arethankful to the technical and managerial staff of MerMecSpA (Italy) for providing the industrial case study of thisresearch

References

[1] T Jarratt J Clarkson and C E Eckert Design Process Improve-ment Springer New York NY USA 2005

[2] C Wanstrom and P Jonsson ldquoThe impact of engineeringchanges on materials planningrdquo Journal of Manufacturing Tech-nology Management vol 17 no 5 pp 561ndash584 2006

[3] S Leng L Wang G Chen and D Tang ldquoEngineering changeinformation propagation in aviation industrial manufacturingexecution processesrdquo The International Journal of AdvancedManufacturing Technology vol 83 no 1 pp 575ndash585 2016

[4] W-H Wu L-C Fang W-Y Wang M-C Yu and H-Y KaoldquoAn advanced CMII-based engineering change managementframework the integration of PLM and ERP perspectivesrdquoInternational Journal of Production Research vol 52 no 20 pp6092ndash6109 2014

Computational Intelligence and Neuroscience 11

[5] T A W Jarratt C M Eckert N H M Caldwell and P JClarkson ldquoEngineering change an overview and perspective onthe literaturerdquo Research in Engineering Design vol 22 no 2 pp103ndash124 2011

[6] B Hamraz N H M Caldwell and P J Clarkson ldquoA holisticcategorization framework for literature on engineering changemanagementrdquo Systems Engineering vol 16 no 4 pp 473ndash5052013

[7] H Ritter T Martinetz and K Schulten Neural Computationand Self-Organizing Maps An Introduction Addison-WesleyNew York NY USA 1992

[8] J Vesanto J Himberg E Alhoniemi and J Parhankangas SOMtoolbox for Matlab 5 2005 httpwwwcishutfisomtoolbox

[9] E Fricke BGebhardHNegele andE Igenbergs ldquoCopingwithchanges causes findings and strategiesrdquo Systems Engineeringvol 3 no 4 pp 169ndash179 2000

[10] J Vesanto and E Alhoniemi ldquoClustering of the self-organizingmaprdquo IEEE Transactions on Neural Networks vol 11 no 3 pp586ndash600 2000

[11] A Sharafi P Wolf and H Krcmar ldquoKnowledge discovery indatabases on the example of engineering change managementrdquoin Proceedings of the Industrial Conference on Data Mining-Poster and Industry Proceedings pp 9ndash16 IBaI July 2010

[12] F Elezi A Sharafi A Mirson P Wolf H Krcmar and ULindemann ldquoA Knowledge Discovery in Databases (KDD)approach for extracting causes of iterations in EngineeringChange Ordersrdquo in Proceedings of the ASME InternationalDesign Engineering Technical Conferences and Computers andInformation in Engineering Conference pp 1401ndash1410 2011

[13] A Sharafi Knowledge Discovery in Databases an Analysis ofChange Management in Product Development Springer 2013

[14] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo Biological Cybernetics vol 43 no 1 pp 59ndash691982

[15] T Kohonen ldquoEssentials of the self-organizing maprdquo NeuralNetworks vol 37 pp 52ndash65 2013

[16] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[17] Y-C Liu M Liu and X-L Wang ldquoApplication of self-organizing maps in text clustering a reviewrdquo in Applicationsof Self-Organizing Maps M Johnsson Ed chapter 10 InTechRijeka Croatia 2012

[18] A Ultsch and H P Siemon ldquoKohonenrsquos self organizing featuremaps for exploratory data analysisrdquo in Proceedings of theProceedings of International Neural Networks Conference (INNCrsquo90) pp 305ndash308 Kluwer Paris France 1990

[19] J Vesanto ldquoSOM-based data visualization methodsrdquo IntelligentData Analysis vol 3 no 2 pp 111ndash126 1999

[20] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 1967

[21] D L Davies and DW Bouldin ldquoA cluster separation measurerdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 1 no 2 pp 224ndash227 1979

[22] N Yorek I Ugulu and H Aydin ldquoUsing self-organizingneural network map combined with Wardrsquos clustering algo-rithm for visualization of studentsrsquo cognitive structural modelsabout aliveness conceptrdquoComputational Intelligence andNeuro-science vol 2016 Article ID 2476256 14 pages 2016

[23] T Honkela S Kaski K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo in Proceedingsof theWork shop on Self-OrganizingMaps (WSOM rsquo97) pp 310ndash315 Espoo Finland June 1997

[24] S Kaski T Honkela K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo Neurocomput-ing vol 21 no 1ndash3 pp 101ndash117 1998

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information ProcessingampManagement vol24 no 5 pp 513ndash523 1988

[26] G Salton and R K Waldstein ldquoTerm relevance weights in on-line information retrievalrdquo Information Processing amp Manage-ment vol 14 no 1 pp 29ndash35 1978

[27] Y C Liu X Wang and C Wu ldquoConSOM a conceptional self-organizingmapmodel for text clusteringrdquoNeurocomputing vol71 no 4ndash6 pp 857ndash862 2008

[28] Y-C Liu C Wu and M Liu ldquoResearch of fast SOM clusteringfor text informationrdquo Expert Systems with Applications vol 38no 8 pp 9325ndash9333 2011

[29] T Kohonen S Kaski K Lagus et al ldquoSelf organization of amassive document collectionrdquo IEEE Transactions on NeuralNetworks vol 11 no 3 pp 574ndash585 2000

[30] J Zhu and S Liu ldquoSOMnetwork based clustering analysis of realestate enterprisesrdquo American Journal of Industrial and BusinessManagement vol 4 no 3 pp 167ndash173 2014

[31] B C Till J Longo A R Dobell and P F Driessen ldquoSelf-organizing maps for latent semantic analysis of free-form textin support of public policy analysisrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 1pp 71ndash86 2014

[32] G Q Huang W Y Yee and K L Mak ldquoCurrent practice ofengineering changemanagement in Hong Kongmanufacturingindustriesrdquo Journal of Materials Processing Technology vol 139no 1ndash3 pp 481ndash487 2003

[33] S Angers ldquoChanging the rules of the lsquoChangersquo gamemdashem-ployees and suppliers work together to transform the 737757production systemrdquo 2002 httpwwwboeingcomnewsfron-tiersarchive2002mayi ca2html

[34] E Subrahmanian C Lee and H Granger ldquoManaging andsupporting product life cycle through engineering changemanagement for a complex productrdquo Research in EngineeringDesign vol 26 no 3 pp 189ndash217 2015

[35] D Zeimpekis and E Gallopoulos TMG A Matlab toolboxfor generating term-document matrices from text collections2006 httpscgroup20ceidupatrasgr8000tmg

[36] M J Zaki and W Meira Jr Data Mining and AnalysisFundamental Concepts and Algorithms Cambridge UniversityPress 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

10 Computational Intelligence and Neuroscience

the cluster of interest Clustering can support thecompany to know if a similar change was alreadymanaged in the past to analyze the best solutionadopted and to learn avoiding the same mistakesmade in the past

(3) Use of SOM for ECRs text clustering allows automat-ically organizing large documents collection Withrespect to other clustering algorithms the mainadvantage of SOM text clustering is that the similarityof the texts is preserved in the spatial organization ofthe neurons The distance among prototypes in theSOMmap can therefore be considered as an estimateof the similarity between documents belonging toclusters In addition SOM can first be computedusing a representative subset of old input data Newinput can be mapped straight into the most similarmodel without recomputing the whole mapping

Nevertheless the study showed some limitations of theapplication of SOM text clustering and classification A firstlimitation is linked to natural language written texts Theterms contained in different texts may be similar even if anengineering change request concerns a different productThesimilarity of terms may influence the performance of SOM-based clustering A second limitation is linked to the use ofSOMas classificationmethod Classification indeed requiresthe labeling of a training datasetThis activity requires a deepknowledge of the different kinds of ECRs managed duringthe engineering change process andmay be difficult and timeconsuming

As a summary research on use of SOM text clusteringin engineering change process analysis appears to be apromising direction for further research A future directionof the work will consider the use of SOM text clustering ona larger dataset of ECRs comparing SOM with other clus-tering algorithms such as119870-means or hierarchical clusteringmethods Another direction of future research concerns theanalysis of SOM robustness to parameters selection (iethe size and structure of the map parameters and kinds oflearning and neighborhood functions)

Symbols

120572(119905) Scalar-valued learning-rate factorN Contingency matrix120575(C119896C119897) Between-cluster distanceΔ(C119896) Within-cluster distanceΔm119894(119905) Adjustment of the prototype vectorm119894(119905)C Collection of documentsC119896 Cluster of documents of index 119896T True partition of documents given by

process operatorsT119903 True partition of documents with label 119903T119903119896 Majority partition containing the

maximum number of documents fromC119896120590(119905) Width of the kernel which corresponds tothe radius of 119873119888(119905)119888(x119895) Index of the BMU associated with x119895119889119895 Document of index 119895

DBI(119870) Davies Bouldin Index for 119870 clusters119889119891119901 Number of documents in the collectionCwhich contains term 119901ℎ119888119894(119905) Neighborhood function119870 Total number of clusters119870lowast Optimum number of clusters119872 Total number of terms in the documentscollection119898119903 Number of documents in partitionT119903119873 Total number of processed documents119873119888(119905) Neighborhood around node 119888119899119896 Number of documents in clusterC119896119899119896119903 Number of documents common to clusterC119896 and partitionT119903119875 Total number of units of SOM119905 Index of time119905119891119901119895 Term frequency of term 119901 in the 119895thdocument119908119901119895 Weight associated with term 119901 in the 119895thdocument

m119888 Prototype vector associated with the BMUm119894 Prototype vector of index 119894r119888 Position vector of index 119888r119894 Position vector of index 119894x119895 Numerical feature vector associated with119895th document

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research has been funded by MIUR ITIA CNR andCluster Fabbrica Intelligente (CFI) Sustainable Manufactur-ing project (CTN01 00163 148175) and Vis4Factory project(Sistemi Informativi Visuali per i processi di fabbrica nelsettore dei trasporti PON02 00634 3551288)The authors arethankful to the technical and managerial staff of MerMecSpA (Italy) for providing the industrial case study of thisresearch

References

[1] T Jarratt J Clarkson and C E Eckert Design Process Improve-ment Springer New York NY USA 2005

[2] C Wanstrom and P Jonsson ldquoThe impact of engineeringchanges on materials planningrdquo Journal of Manufacturing Tech-nology Management vol 17 no 5 pp 561ndash584 2006

[3] S Leng L Wang G Chen and D Tang ldquoEngineering changeinformation propagation in aviation industrial manufacturingexecution processesrdquo The International Journal of AdvancedManufacturing Technology vol 83 no 1 pp 575ndash585 2016

[4] W-H Wu L-C Fang W-Y Wang M-C Yu and H-Y KaoldquoAn advanced CMII-based engineering change managementframework the integration of PLM and ERP perspectivesrdquoInternational Journal of Production Research vol 52 no 20 pp6092ndash6109 2014

Computational Intelligence and Neuroscience 11

[5] T A W Jarratt C M Eckert N H M Caldwell and P JClarkson ldquoEngineering change an overview and perspective onthe literaturerdquo Research in Engineering Design vol 22 no 2 pp103ndash124 2011

[6] B Hamraz N H M Caldwell and P J Clarkson ldquoA holisticcategorization framework for literature on engineering changemanagementrdquo Systems Engineering vol 16 no 4 pp 473ndash5052013

[7] H Ritter T Martinetz and K Schulten Neural Computationand Self-Organizing Maps An Introduction Addison-WesleyNew York NY USA 1992

[8] J Vesanto J Himberg E Alhoniemi and J Parhankangas SOMtoolbox for Matlab 5 2005 httpwwwcishutfisomtoolbox

[9] E Fricke BGebhardHNegele andE Igenbergs ldquoCopingwithchanges causes findings and strategiesrdquo Systems Engineeringvol 3 no 4 pp 169ndash179 2000

[10] J Vesanto and E Alhoniemi ldquoClustering of the self-organizingmaprdquo IEEE Transactions on Neural Networks vol 11 no 3 pp586ndash600 2000

[11] A Sharafi P Wolf and H Krcmar ldquoKnowledge discovery indatabases on the example of engineering change managementrdquoin Proceedings of the Industrial Conference on Data Mining-Poster and Industry Proceedings pp 9ndash16 IBaI July 2010

[12] F Elezi A Sharafi A Mirson P Wolf H Krcmar and ULindemann ldquoA Knowledge Discovery in Databases (KDD)approach for extracting causes of iterations in EngineeringChange Ordersrdquo in Proceedings of the ASME InternationalDesign Engineering Technical Conferences and Computers andInformation in Engineering Conference pp 1401ndash1410 2011

[13] A Sharafi Knowledge Discovery in Databases an Analysis ofChange Management in Product Development Springer 2013

[14] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo Biological Cybernetics vol 43 no 1 pp 59ndash691982

[15] T Kohonen ldquoEssentials of the self-organizing maprdquo NeuralNetworks vol 37 pp 52ndash65 2013

[16] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[17] Y-C Liu M Liu and X-L Wang ldquoApplication of self-organizing maps in text clustering a reviewrdquo in Applicationsof Self-Organizing Maps M Johnsson Ed chapter 10 InTechRijeka Croatia 2012

[18] A Ultsch and H P Siemon ldquoKohonenrsquos self organizing featuremaps for exploratory data analysisrdquo in Proceedings of theProceedings of International Neural Networks Conference (INNCrsquo90) pp 305ndash308 Kluwer Paris France 1990

[19] J Vesanto ldquoSOM-based data visualization methodsrdquo IntelligentData Analysis vol 3 no 2 pp 111ndash126 1999

[20] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 1967

[21] D L Davies and DW Bouldin ldquoA cluster separation measurerdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 1 no 2 pp 224ndash227 1979

[22] N Yorek I Ugulu and H Aydin ldquoUsing self-organizingneural network map combined with Wardrsquos clustering algo-rithm for visualization of studentsrsquo cognitive structural modelsabout aliveness conceptrdquoComputational Intelligence andNeuro-science vol 2016 Article ID 2476256 14 pages 2016

[23] T Honkela S Kaski K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo in Proceedingsof theWork shop on Self-OrganizingMaps (WSOM rsquo97) pp 310ndash315 Espoo Finland June 1997

[24] S Kaski T Honkela K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo Neurocomput-ing vol 21 no 1ndash3 pp 101ndash117 1998

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information ProcessingampManagement vol24 no 5 pp 513ndash523 1988

[26] G Salton and R K Waldstein ldquoTerm relevance weights in on-line information retrievalrdquo Information Processing amp Manage-ment vol 14 no 1 pp 29ndash35 1978

[27] Y C Liu X Wang and C Wu ldquoConSOM a conceptional self-organizingmapmodel for text clusteringrdquoNeurocomputing vol71 no 4ndash6 pp 857ndash862 2008

[28] Y-C Liu C Wu and M Liu ldquoResearch of fast SOM clusteringfor text informationrdquo Expert Systems with Applications vol 38no 8 pp 9325ndash9333 2011

[29] T Kohonen S Kaski K Lagus et al ldquoSelf organization of amassive document collectionrdquo IEEE Transactions on NeuralNetworks vol 11 no 3 pp 574ndash585 2000

[30] J Zhu and S Liu ldquoSOMnetwork based clustering analysis of realestate enterprisesrdquo American Journal of Industrial and BusinessManagement vol 4 no 3 pp 167ndash173 2014

[31] B C Till J Longo A R Dobell and P F Driessen ldquoSelf-organizing maps for latent semantic analysis of free-form textin support of public policy analysisrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 1pp 71ndash86 2014

[32] G Q Huang W Y Yee and K L Mak ldquoCurrent practice ofengineering changemanagement in Hong Kongmanufacturingindustriesrdquo Journal of Materials Processing Technology vol 139no 1ndash3 pp 481ndash487 2003

[33] S Angers ldquoChanging the rules of the lsquoChangersquo gamemdashem-ployees and suppliers work together to transform the 737757production systemrdquo 2002 httpwwwboeingcomnewsfron-tiersarchive2002mayi ca2html

[34] E Subrahmanian C Lee and H Granger ldquoManaging andsupporting product life cycle through engineering changemanagement for a complex productrdquo Research in EngineeringDesign vol 26 no 3 pp 189ndash217 2015

[35] D Zeimpekis and E Gallopoulos TMG A Matlab toolboxfor generating term-document matrices from text collections2006 httpscgroup20ceidupatrasgr8000tmg

[36] M J Zaki and W Meira Jr Data Mining and AnalysisFundamental Concepts and Algorithms Cambridge UniversityPress 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

Computational Intelligence and Neuroscience 11

[5] T A W Jarratt C M Eckert N H M Caldwell and P JClarkson ldquoEngineering change an overview and perspective onthe literaturerdquo Research in Engineering Design vol 22 no 2 pp103ndash124 2011

[6] B Hamraz N H M Caldwell and P J Clarkson ldquoA holisticcategorization framework for literature on engineering changemanagementrdquo Systems Engineering vol 16 no 4 pp 473ndash5052013

[7] H Ritter T Martinetz and K Schulten Neural Computationand Self-Organizing Maps An Introduction Addison-WesleyNew York NY USA 1992

[8] J Vesanto J Himberg E Alhoniemi and J Parhankangas SOMtoolbox for Matlab 5 2005 httpwwwcishutfisomtoolbox

[9] E Fricke BGebhardHNegele andE Igenbergs ldquoCopingwithchanges causes findings and strategiesrdquo Systems Engineeringvol 3 no 4 pp 169ndash179 2000

[10] J Vesanto and E Alhoniemi ldquoClustering of the self-organizingmaprdquo IEEE Transactions on Neural Networks vol 11 no 3 pp586ndash600 2000

[11] A Sharafi P Wolf and H Krcmar ldquoKnowledge discovery indatabases on the example of engineering change managementrdquoin Proceedings of the Industrial Conference on Data Mining-Poster and Industry Proceedings pp 9ndash16 IBaI July 2010

[12] F Elezi A Sharafi A Mirson P Wolf H Krcmar and ULindemann ldquoA Knowledge Discovery in Databases (KDD)approach for extracting causes of iterations in EngineeringChange Ordersrdquo in Proceedings of the ASME InternationalDesign Engineering Technical Conferences and Computers andInformation in Engineering Conference pp 1401ndash1410 2011

[13] A Sharafi Knowledge Discovery in Databases an Analysis ofChange Management in Product Development Springer 2013

[14] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo Biological Cybernetics vol 43 no 1 pp 59ndash691982

[15] T Kohonen ldquoEssentials of the self-organizing maprdquo NeuralNetworks vol 37 pp 52ndash65 2013

[16] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[17] Y-C Liu M Liu and X-L Wang ldquoApplication of self-organizing maps in text clustering a reviewrdquo in Applicationsof Self-Organizing Maps M Johnsson Ed chapter 10 InTechRijeka Croatia 2012

[18] A Ultsch and H P Siemon ldquoKohonenrsquos self organizing featuremaps for exploratory data analysisrdquo in Proceedings of theProceedings of International Neural Networks Conference (INNCrsquo90) pp 305ndash308 Kluwer Paris France 1990

[19] J Vesanto ldquoSOM-based data visualization methodsrdquo IntelligentData Analysis vol 3 no 2 pp 111ndash126 1999

[20] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium onMathematical Statistics and Probability vol 1 pp281ndash297 1967

[21] D L Davies and DW Bouldin ldquoA cluster separation measurerdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 1 no 2 pp 224ndash227 1979

[22] N Yorek I Ugulu and H Aydin ldquoUsing self-organizingneural network map combined with Wardrsquos clustering algo-rithm for visualization of studentsrsquo cognitive structural modelsabout aliveness conceptrdquoComputational Intelligence andNeuro-science vol 2016 Article ID 2476256 14 pages 2016

[23] T Honkela S Kaski K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo in Proceedingsof theWork shop on Self-OrganizingMaps (WSOM rsquo97) pp 310ndash315 Espoo Finland June 1997

[24] S Kaski T Honkela K Lagus and T Kohonen ldquoWebsommdashself-organizing maps of document collectionsrdquo Neurocomput-ing vol 21 no 1ndash3 pp 101ndash117 1998

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information ProcessingampManagement vol24 no 5 pp 513ndash523 1988

[26] G Salton and R K Waldstein ldquoTerm relevance weights in on-line information retrievalrdquo Information Processing amp Manage-ment vol 14 no 1 pp 29ndash35 1978

[27] Y C Liu X Wang and C Wu ldquoConSOM a conceptional self-organizingmapmodel for text clusteringrdquoNeurocomputing vol71 no 4ndash6 pp 857ndash862 2008

[28] Y-C Liu C Wu and M Liu ldquoResearch of fast SOM clusteringfor text informationrdquo Expert Systems with Applications vol 38no 8 pp 9325ndash9333 2011

[29] T Kohonen S Kaski K Lagus et al ldquoSelf organization of amassive document collectionrdquo IEEE Transactions on NeuralNetworks vol 11 no 3 pp 574ndash585 2000

[30] J Zhu and S Liu ldquoSOMnetwork based clustering analysis of realestate enterprisesrdquo American Journal of Industrial and BusinessManagement vol 4 no 3 pp 167ndash173 2014

[31] B C Till J Longo A R Dobell and P F Driessen ldquoSelf-organizing maps for latent semantic analysis of free-form textin support of public policy analysisrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 1pp 71ndash86 2014

[32] G Q Huang W Y Yee and K L Mak ldquoCurrent practice ofengineering changemanagement in Hong Kongmanufacturingindustriesrdquo Journal of Materials Processing Technology vol 139no 1ndash3 pp 481ndash487 2003

[33] S Angers ldquoChanging the rules of the lsquoChangersquo gamemdashem-ployees and suppliers work together to transform the 737757production systemrdquo 2002 httpwwwboeingcomnewsfron-tiersarchive2002mayi ca2html

[34] E Subrahmanian C Lee and H Granger ldquoManaging andsupporting product life cycle through engineering changemanagement for a complex productrdquo Research in EngineeringDesign vol 26 no 3 pp 189ndash217 2015

[35] D Zeimpekis and E Gallopoulos TMG A Matlab toolboxfor generating term-document matrices from text collections2006 httpscgroup20ceidupatrasgr8000tmg

[36] M J Zaki and W Meira Jr Data Mining and AnalysisFundamental Concepts and Algorithms Cambridge UniversityPress 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article On the Use of Self-Organizing Map …downloads.hindawi.com/journals/cin/2016/5139574.pdfResearch Article On the Use of Self-Organizing Map for Text Clustering in Engineering

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014