Research Article Research on Hotspot Discovery in Internet ...

Hindawi Publishing CorporationComputational Intelligence and NeuroscienceVolume 2013 Article ID 230946 6 pageshttpdxdoiorg1011552013230946

Research ArticleResearch on Hotspot Discovery in Internet Public OpinionsBased on Improved 119870-Means

Gensheng Wang

Electronic Business Department Jiangxi University of Finance and Economics Nanchang 330013 China

Correspondence should be addressed to Gensheng Wang wanggenshengjc163com

Received 31 March 2013 Revised 5 June 2013 Accepted 21 July 2013

Academic Editor Saeid Sanei

Copyright copy 2013 Gensheng Wang This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

How to discover hotspot in the Internet public opinions effectively is a hot research field for the researchers related which playsa key role for governments and corporations to find useful information from mass data in the Internet An improved 119870-meansalgorithm for hotspot discovery in internet public opinions is presented based on the analysis of existing defects and calculationprinciple of original119870-means algorithm First some new methods are designed to preprocess website texts select and express thecharacteristics of website texts and define the similarity between two website texts respectively Second clustering principle andthe method of initial classification centers selection are analyzed and improved in order to overcome the limitations of original119870-means algorithm Finally the experimental results verify that the improved algorithm can improve the clustering stability andclassification accuracy of hotspot discovery in internet public opinions when used in practice

1 Introduction

The rapid development of the Internet exerts a profoundimpact on the country society and individuals and how toeffectively master mass data and extract the hotspot informa-tion therein have been a problem urgently to be solved in themanagement of internet public opinions Solving this prob-lem has an extensive application prospect first for individu-als it is an important means to promptly and convenientlyobtain the hotspot information in current society secondfor enterprises it can help enterprises master the most cut-ting-edge information and hot technology in their fieldsincrease their competitiveness for enterprises through thismethod especially for the country it can provide impor-tant clues for relevant departments of the governments topromptly know about the direction of public opinions incurrent society be conductive to the governments to analyzeand guide the public opinions actively guide the healthydevelopment of internet public opinionsmeanwhile help thegovernments to grasp the problems mostly cared by the peo-ple in each period as well as the viewpoints and attitudes onthese problems so as to make scientific and correct deci-sion keep the society stable and truly reach the aim that

the Internet serves for the society and the people In the pastpublic opinions workers rely on manual work to sort thecontents on the webpage to discover the hotspot informationof the society not only low efficiency in work but alsoeasy to be subjectively influenced andmake the result deviatefrom the truth At present search engines to some extentmeet peoplersquos demand on rapidly acquiring informationneeded amongmassive andmessed information however itsadoption of simple key words matching to find informationcauses a great deal of redundant and irrelevant contents insearch results results in redundant information overwhelm-ing the information needed leads to the incomplete analysison topics of relevant personnel and makes it difficult to havea comprehensive mastery The premise for discovering hot-spot information by search engines is that analysts know inadvance the existence of such topics so such method is obvi-ously lagging and it is not good for discovering newproblemseasy to miss the best timing to solve problems making theproblems spread and difficult to be controlled Thereforeif the real-time hotspot information in a period is to beobtained and the internet hotspot topics in current society areto be periodically discovered automatic solutions are becom-ing a valuable research orientation

2 Computational Intelligence and Neuroscience

2 Literature Review

At present the study on hotspot discovery of internet publicopinions at home and abroad mainly focuses on such twoaspects as internet information processing and data min-ing (1) In the aspect of internet information processingthe main research contents of scholars at home and abroadinclude word segmentation technology measuring of mul-tidimensional vector space on article theme [1] (2) In theaspect of internet data mining contents involved are infor-mation acquisition of public opinions automatic classifica-tion automatic clustering and so forth and this kind ofmethods has obtained certain achievements For instanceHamerly and Elkan on the basis of analyzing the shortages oforiginal 119870-means and its reasons put forward a new modelto mine and analyze internet public opinions informationand illustrated the application of text mining in the analysisof internet public opinions [2] Kristina analyzed the basicsituation of internet public opinions and designed an analyz-ing model of internet public opinions based on themes [3]Andreas combined the advantages of comprehensive parti-tional clustering and agglomerate clustering and put forwardan incremental hierarchical clustering algorithm and appliedit to hot topic discovery in internet public opinion [4] Wag-staff and Rogers combined natural language processing withinformation retrieval technology and put forward a veryeffective single-granularity topic identification method as tothe event features [5] Ya designed a hotspot events discoverysystemwhich is geared to the needs to internet news coverageand able to automatically find the hotspot events on the inter-net within any period [6] Bradley andManagasarian accord-ing to the demands on the analysis of internet public opin-ions built the discovery and analysis system of internet pub-lic opinion hotspots problems based on clustering [7] As formass internet public opinion information how to improvethe effectiveness and efficiency of analysis and processing aswell as the accuracy and efficiency of the analysis of inter-net public opinion hotspots remains a hotspot for currentresearch

Currently domestic and overseas studies on the clusteringmethods of internet public opinions are mainly divided intothe following categories partitional clustering hierarchicalclustering clustering based on density artificial neural net-work clustering clustering based on internet and so forthin which clustering is widely applied According to differentobjects application fields and aims of clustering there arespecific requirements on the quality efficiency and resultvisualization degree of clustering for clustering methodsHence proper clustering algorithm shall be selected asrequired by specific conditions among which as to text clus-tering119870-means clustering due to its features like incrementbatch processing speediness and efficiency as well as itsadvantage in applicable to dynamically process mass data ofinternetmedia information is widely applied in the detectionof internet hotspot topics However the clustering quality in119870-means algorithm relies too much on the initial number ofclusters and initial clustering centers which shall be con-quered in actual application

119870-means algorithmis one of the best information cluster-ing methods in data mining which can extract and find newknowledge But it is found that 119870-means algorithm used inprocessing the data of isolated points has great limitations[6ndash8] The paper tries to present some improvements toover- come these limitations and takes advantage of powerfulclassification ability of the algorithm to discovery hotspot ininternet public opinions

3 Text Preprocessing

Hotspot discovery depends on website text clustering whichcan be described as a given text set119863 = 119889

1 1198892 119889

119899 even-

tually get a clusterrsquos set 119862 = 1198621 1198622 119862

119899 cup119870119894=1119862119894= 119863

derive for all 119889119894(119889119894isin 119863) exist119862

119895(119862119895isin 119862) and 119889

119894isin 119862119895 and also

make the objective function 119876(119862) reach the minimum ormaximum value of which 119899 is total text number 119870 is finalclustering number and 119862

119895cap 119862119894= 120601 119895 = 119894

31 Characteristic Selection and Expression of Website TextVector space model (VSM) is commonly adopted to expresseach text In this model each text 119889 is considered as a vectorin a vector space 119905119891119894119889119891 is used as a measure of characteristicvector in this paper and this measure gives the weight of eachword 119905 See (1) for the calculation of the weight

119905119891119894119889119891 (119889 119905) = 119905119891 (119889119905) lowast log2

119873

119889119891 (119905) (1)

In (1) 119905119891(119889 119905) is the word frequency of word 119905 in the text119889 119889119891(119905) is all the text numbers of word 119905 contained in thetext set119863 and119873 is total text number After the characteristicselection text 119889 isin 119863 is the form of the vector and the value ofeach dimension is the corresponding 119905119891119894119889119891(119889 119905)weight valueso the text can be expressed as follows

119889 = (119905119894 119905119891119894119889119891 (119889 119905

119894)) | 1 le 119894 le 119898 (2)

of which 119905119894is the lexical entry and 119898 is the dimension of the

characteristic vector However after the characteristic selec-tion119898 is still very large thousands of dimensions at least andtens of thousands of dimensions at most while nonzero wordfrequency of each corresponding text vector is very fewwhich makes text VSM show the high dimension

32 Definition of Similarity In this paper cosine distance isused to measure the similarity between the website texts anddefines the similarity of two texts 119889

1and 119889

2as follows

Sim (1198891 1198892) = cos (119889

1 1198892) =

(1198891lowast 1198892)

(norm (1198891) lowast norm (119889

2))

(3)

In order to reduce the impact of different length of thetexts on calculating the text similarity each text vector hasbeen integrated to the unit length See (2)

119889 =119889

119889=

119905119891119894119889119891 (119889 1199051) 119891119894119889119891 (119889 119905

2) 119891119894119889119891 (119889 119905

119898)

radic119905119891119894119889119891(119889 1199051)2

119891119894119889119891(119889 1199052)2

119891119894119889119891(119889 119905119898)2

(4)

Computational Intelligence and Neuroscience 3

Figure 1 The procedures of 119870-means algorithm

Thus 119889 = 1 and the similarity of the cosine is the dotproduct of two text vectors that is Sim (119889

1 1198892) = 1198891sdot 1198892

4 Derivation of Hotspot Discovery Algorithm

41119870-Means Algorithm Principle Steps for119870-means cluster-ing algorithm are as follows [8] (see Figure 1)

(1) Select 119899 objects as the initial cluster seeds on principle(2) Reassign each object to the most similar cluster in

terms of the value of the cluster seeds(3) Update the cluster seeds that is recompute the mean

value of the object in each cluster and take the meanvalue points of the objects as new cluster seeds

(4) Repeat (2) and (3) until no change in each cluster

42 Limitation of119870-Means Algorithm When119870-means algo-rithm is used to cluster data the stability of the clusteringresults is still not good enough sometimes the clusteringeffect is very good (when the data distribution is convex-shaped or spherical) while sometimes the clustering resultshave obvious deviation and errors which lies in the dataanalysis It is unavoidable for the clustered data to have iso-lated points referring to the situation that a few data deviatefrom the high-dense data in intensive zone The clusteringmean point (geometrical central point of all data in thecategory) is used as a new clustering seed for the 119870-meansclustering calculation to carry out the next turn of clusteringcalculation while under such a situation the new clusteringseed might deviate from the true data intensive zone andfurther cause the deviation of the clustering results [9]There-fore it is found that using 119870-means algorithm to process thedata of isolated points has a great limitation

43 Improving of 119870-Means Algorithm Principle The original119870-means algorithm selects119870 points as initial cluster centersand then the iterative operation begins Different selectionof initial point can achieve different clustering result For

the reduction of the clustering resultrsquos dependence on theinitial value and the improvement of the clustering stabilitybetter initial cluster centers can be achieved by the search alg-orithm of the cluster center [9 10]

In the search process the sampled data tries to beundistorted and is able to reflect the original data distributionthrough the random data sampling as shown in Figure 2among which (a) original data distribution (b) sampled datadistribution

The sampled data and the original data are clustered by119870-means algorithm respectively and little change of finalcluster centers is found Therefore the sampling method issuitable for the selection of the initial cluster centers In ordertominimize the sampling effects on the selection of the initialcluster centers the sample set extracted each time should beable to be loaded into the memory and do best to make thesum of the sample sets extracted 119869 times equivalent to theoriginal data set Each extracted sample data is clustered by119870-means algorithm and one group of cluster center is pro-duced respectively the samplings 119869 times produce 119869 groupsof the cluster centers in all and then the comparison of clus-tering criterion function values is conducted for 119869 groups ofcluster centers and one group of minimum cluster center in119869119888value is given as the optimal initial cluster centerFor the protection against segmenting large clusters into

small clusters by the criterion function the algorithm takesthe initial cluster as 1198701015840and 1198701015840 ≻ 119870 According to the qualityrequirements and the time 1198701015840 value does the compromiseselection Larger 1198701015840 value is able to expand the solutionsearch scope and the phenomenon of no initial value nearcertain extremal vertexes is diminishedThe utilization of thesearched initial cluster center clusters the original data byanother 119870-means algorithm and outputs 1198701015840 cluster centersand then the reduction of each cluster quantity to the spec-ified119870 value is studied

44 Improving the Selection of Initial Classification CentersThe basic idea of new selection method of initial clustercenters is based on the assumption that the distribution of thewebsite text sets has been known a good initial cluster centershould satisfy the following rules in the paper

(1) The selected initial centers belong to different clustersrespectively that is any two initial centers cannot bethe same cluster

(2) The selected initial cluster centers should representthis cluster that is be as close as possible to the clustercenters To select 119870 texts as initial cluster centers andat the same time ensure that 119870 texts just belong todifferent clusters such strict constraints are difficultto be achieved through random sampling as much aspossible so it is thought that in order to minimizethe samplingrsquos effect on initial cluster centers119898 timesof samplings are taken and the sample size is 119899119898 ofwhich 119899 is the number of the text in the text setsthe value of 119898 is that each sample size should be putinto the main storage and as far as possible satisfiesthe fact that the sum of the samples taken for119898 timesis equivalent to the original text set Each sample text


(a) Original data distribution (b) Sampled data distribution

Figure 2 Comparison of data distribution before and after sampling

taken is clustered by 119870-means algorithm to producea group of text clusters with119870 cluster centers respec-tively119898 times of sampling operation produce119898 times119870

cluster centers in all and then agglomerative hier-archical clustering algorithm single-link algorithm isused to do the clustering to obtain 119870 clusters ofwhich the average value is the final 119870 initial clustercenters Different from the division strategies taken by119870-means algorithm the agglomerative hierarchicalclustering algorithm does not exist in the selection ofthe initial cluster centers It regards each text as a clus-ter at first the text is the centre of this cluster and eachstep of clustering combines the twomost similar clus-ters into a cluster until all the texts are integrated into acluster or only119870 clusters With clustering the similartext is integrated into a cluster gradually and thehierarchical clustering is able to automatically gener-ate different hierarchical clustering model

In the combination of agglomerative hierarchical cluster-ing algorithm and 119870-means algorithm a hierarchical clus-tering algorithm based on119870-means is addressed to select theinitial cluster centers that is the cluster centers produced by119870-means method restrain the agglomerative space of theagglomerative hierarchical clustering algorithm The selec-tion method of the initial cluster centers is generallydescribed as follows

(1) 119898 times of sampling are taken for the text sets whichare divided into119898 sample sets 119878

1 1198782 119878

119898

(2) Each sample set performs 119870-means algorithm res-pectively to produce119898 groups of119870 cluster centers

(3) Another clustering is done for119898times119870 cluster centers bythe agglomerative hierarchical clustering algorithm(single-link algorithm is used here) until only having119870 clusters and the average value of each cluster istaken as the initial cluster centers of next step of 119870-means algorithm

From the previous algorithm it is seen that the text set ofthe sample taken is smaller than the original text set so the

search process amount of the initial cluster centre is less theiterative number is less and the speed is faster at the sametime it is also ensured that the final cluster centers belong todifferent clusters and have adequate representation

The specific algorithm flow used in the paper can refer tothe reference [8]

5 Experimental Verification

51 Data Acquisition and Preprocessing Verification dataacquisition and preprocessing in this paper mainly includethe following steps (1) Public opinions data acquisitionadopts web search technology traversing the entire Webspace within designated scope to collect all kinds of pub-lic opinions information establishing indexes of acquiredinformation through indexer and save in the index databaseObjects of data acquisition aremainly eachmajorwebportalsBBS blogs and so forth (2) Word segmentation processingof website text public opinions information acquired areunstructured data which shall be preprocessed Word seg-mentation study of Chinese language has been mature Thisthesis adopts the Chinese Lexical Analysis System of Instituteof Computing Technology (ICTCLAS) (3) Text featuresabstraction the aim of selecting features is to further filterworks with no much amount of information and less influ-ence on the discovery of public opinions hotspots reachingthe effect of dimension reduction of website feature vector soas to improve the processing efficiency and reduce the com-plexity of calculation Form of dimension reduction adoptedin this thesis to build evaluation function of webpage themethrough statistical methods evaluating each feature vectorand choosing words meeting the preset threshold as the fea-ture item of webpage (4) Feature representation this paperadopts vector space model (VSM) to indicate public opinionsinformation here omit the specific forms

52 Experimental Results Considering that news is paid highattention in Internet information and it is easy to collect infor-mation this paper takes internet news as verification dataFirst randomly choose 8919 pieces of news among the politicsnews on December 1 2012 to December 15 2012 as the test


Table 1 Parts statistics of feature vector word frequency

Diaoyu Islands American China Syria Russia Military Japan Shinzo Abe Obama Hugo Chavez12156 8973 9987 4612 3416 1256 3421 1281 2521 1452

Table 2 Parts statistics of news hotspots themes

News themes The numberof pages Feature words

Diaoyu Islands 1524Sovereignty Shinzo Abe islandPurchase Escort Military FighterAmerican China Japan

Syria Crisis 642

The opposition Muslim ShiiteSunnite BaShaEr AntiterrorismIran Russia American the Arableague

Table 3 Algorithm performance 1198651comparison of different algo-

rithms

1198651

1198651typicalvalue

1198651of original

119870-means algorithmfalling into theexperimental

frequency of thisinterval

1198651of improved



[015 025] 020 1 0[025 035] 030 2 0[035 045] 040 2 0[045 055] 050 4 0[055 065] 060 5 0[065 075] 070 7 9[075 085] 080 2 11[085 095] 090 1 8[095 100] 10 0 0

samples obtained by features words of webpage cluster Aswebpage comes from real website webpage data have cer-tain complexity and randomness After word segmentationprocessing there are 68213 words in total 52173 words areobtained after stopwords processing to carry out informationfor subsequent calculation take top 10 words that is 6512words as the feature vector of webpage text Test results areas shown in Tables 1 2 and 3 Table 1 is the statistical table ofvocabulary and word frequency with large information gainvalue Table 2 is the statistical result of news hotspots themesTable 3 is the clustering performance comparison of thealgorithm in this paper and ordinary119870-means [8]

In Table 31198651means119865-measure value and119865

1distribution

is wildly used to illustrate the performance of differentalgorithms [3ndash6 11] Using the data introduced previouslyand specific calculation items can be seen in [8] It can be seenfrom Table 3 that there is poor stability in the clusteringresults obtained by ordinary 119870-means algorithm and scat-tered119865-measure value but the improved clustering algorithmhas better stability of the clustering results more concen-trated 119865-measure value and higher 119865-measure average value

The experiment shows that the improved clustering algo-rithm improves its accuracy and stability greatly In the use ofordinary119870-means algorithm119865-value of the clustering resultsscatters from 060 to 075 in the use of the improved alg-orithm the stability of its value is from 075 to 085

6 Conclusion

Nowadays internet is becoming the main channel for peopleto obtain and release information the guiding role of internetpublic opinions information is larger and larger it hasarousedwide attention in the industry how to carry out publicopinions gathering and hotspots discovery on the basis ofinformation acquisition of Internet public opinions as wellas track and analyze the hotspots to guarantee the informa-tion security Under such background this paper based onanalyzing the advantages and disadvantages of all kinds ofclustering algorithms chooses 119870-means clustering as thewebsite text clustering model and puts forward a new discov-ery algorithm of internet public opinions hotspots throughimproving its shortcoming of sensitivity to initial number ofclusters and initial clustering centers The test illustrates theapplicability and reliability of method in this paper The nextstudy shall be focused on clustering of features of internetinformation text for the sake of final realization of clusteringalgorithm applicable to all the languages

References

[1] H Liu and JHXu ldquoResearch of internet public opinion hotspotdetectionrdquo Bulletin of Science and Technology vol 27 no 3 pp421ndash425 2011

[2] G Hamerly and C Elkan ldquoA new algorithm based on K-meansand its application in internet public opinion hotspot detectionrdquoPattern Recognition vol 32 no 6 pp 521ndash534 2012

[3] L M Kristina ldquoDocument clustering in reduced dimensionvector spacerdquo Journal of ComputerApplication vol 27 no 10 pp37ndash49 2011

[4] H J Andreas ldquoResearch on text document clusteringrdquo Com-puter Simulation vol 24 no 7 pp 84ndash99 2010

[5] C D Wagstaff and S S Rogers ldquoConstrained K-meansclustering with background knowledgerdquo Journal of ComputerEngineering and Application vol 21 no 5 pp 467ndash479 2011

[6] B T Ya ldquoResearch on public opinion hotspot detection basedon SVMrdquo Science and TechnologyManagement Research vol 25no 2 pp 64ndash69 2009

[7] P S Bradley and L S Managasarian ldquoK-plane clusteringrdquo Jour-nal of Global Optimization vol 16 pp 23ndash32 2010

[8] Y Tang and Q S Rong ldquoAn implementation of clustering alg-orithm based on K-meansrdquo Journal of Hubei Institute ForNationalities vol 22 no 1 pp 69ndash71 2011

[9] Z H Yang and Y T Yang ldquoDocument clustering method basedon hybrid of SOM and K-meansrdquoComputer Application vol 27no 5 pp 73ndash75 2012


[10] Y F Zhang and J L Mao ldquoAn improved K-means algorithmrdquoComputer Application vol 23 no 8 pp 31ndash33 2009

[11] N Li and D D Wu ldquoUsing text mining and sentiment analysisfor online forums hotspot detection and forecastrdquo DecisionSupport Systems vol 48 no 2 pp 354ndash368 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



2 Literature Review

At present the study on hotspot discovery of internet publicopinions at home and abroad mainly focuses on such twoaspects as internet information processing and data min-ing (1) In the aspect of internet information processingthe main research contents of scholars at home and abroadinclude word segmentation technology measuring of mul-tidimensional vector space on article theme [1] (2) In theaspect of internet data mining contents involved are infor-mation acquisition of public opinions automatic classifica-tion automatic clustering and so forth and this kind ofmethods has obtained certain achievements For instanceHamerly and Elkan on the basis of analyzing the shortages oforiginal 119870-means and its reasons put forward a new modelto mine and analyze internet public opinions informationand illustrated the application of text mining in the analysisof internet public opinions [2] Kristina analyzed the basicsituation of internet public opinions and designed an analyz-ing model of internet public opinions based on themes [3]Andreas combined the advantages of comprehensive parti-tional clustering and agglomerate clustering and put forwardan incremental hierarchical clustering algorithm and appliedit to hot topic discovery in internet public opinion [4] Wag-staff and Rogers combined natural language processing withinformation retrieval technology and put forward a veryeffective single-granularity topic identification method as tothe event features [5] Ya designed a hotspot events discoverysystemwhich is geared to the needs to internet news coverageand able to automatically find the hotspot events on the inter-net within any period [6] Bradley andManagasarian accord-ing to the demands on the analysis of internet public opin-ions built the discovery and analysis system of internet pub-lic opinion hotspots problems based on clustering [7] As formass internet public opinion information how to improvethe effectiveness and efficiency of analysis and processing aswell as the accuracy and efficiency of the analysis of inter-net public opinion hotspots remains a hotspot for currentresearch

Currently domestic and overseas studies on the clusteringmethods of internet public opinions are mainly divided intothe following categories partitional clustering hierarchicalclustering clustering based on density artificial neural net-work clustering clustering based on internet and so forthin which clustering is widely applied According to differentobjects application fields and aims of clustering there arespecific requirements on the quality efficiency and resultvisualization degree of clustering for clustering methodsHence proper clustering algorithm shall be selected asrequired by specific conditions among which as to text clus-tering119870-means clustering due to its features like incrementbatch processing speediness and efficiency as well as itsadvantage in applicable to dynamically process mass data ofinternetmedia information is widely applied in the detectionof internet hotspot topics However the clustering quality in119870-means algorithm relies too much on the initial number ofclusters and initial clustering centers which shall be con-quered in actual application

119870-means algorithmis one of the best information cluster-ing methods in data mining which can extract and find newknowledge But it is found that 119870-means algorithm used inprocessing the data of isolated points has great limitations[6ndash8] The paper tries to present some improvements toover- come these limitations and takes advantage of powerfulclassification ability of the algorithm to discovery hotspot ininternet public opinions

3 Text Preprocessing

Hotspot discovery depends on website text clustering whichcan be described as a given text set119863 = 119889

1 1198892 119889

119899 even-

tually get a clusterrsquos set 119862 = 1198621 1198622 119862

119899 cup119870119894=1119862119894= 119863

derive for all 119889119894(119889119894isin 119863) exist119862

119895(119862119895isin 119862) and 119889

119894isin 119862119895 and also

make the objective function 119876(119862) reach the minimum ormaximum value of which 119899 is total text number 119870 is finalclustering number and 119862

119895cap 119862119894= 120601 119895 = 119894

31 Characteristic Selection and Expression of Website TextVector space model (VSM) is commonly adopted to expresseach text In this model each text 119889 is considered as a vectorin a vector space 119905119891119894119889119891 is used as a measure of characteristicvector in this paper and this measure gives the weight of eachword 119905 See (1) for the calculation of the weight

119905119891119894119889119891 (119889 119905) = 119905119891 (119889119905) lowast log2

119873

119889119891 (119905) (1)

In (1) 119905119891(119889 119905) is the word frequency of word 119905 in the text119889 119889119891(119905) is all the text numbers of word 119905 contained in thetext set119863 and119873 is total text number After the characteristicselection text 119889 isin 119863 is the form of the vector and the value ofeach dimension is the corresponding 119905119891119894119889119891(119889 119905)weight valueso the text can be expressed as follows

119889 = (119905119894 119905119891119894119889119891 (119889 119905

119894)) | 1 le 119894 le 119898 (2)

of which 119905119894is the lexical entry and 119898 is the dimension of the

characteristic vector However after the characteristic selec-tion119898 is still very large thousands of dimensions at least andtens of thousands of dimensions at most while nonzero wordfrequency of each corresponding text vector is very fewwhich makes text VSM show the high dimension

32 Definition of Similarity In this paper cosine distance isused to measure the similarity between the website texts anddefines the similarity of two texts 119889

1and 119889

2as follows

Sim (1198891 1198892) = cos (119889

1 1198892) =

(1198891lowast 1198892)

(norm (1198891) lowast norm (119889

2))

(3)

In order to reduce the impact of different length of thetexts on calculating the text similarity each text vector hasbeen integrated to the unit length See (2)

119889 =119889

119889=

119905119891119894119889119891 (119889 1199051) 119891119894119889119891 (119889 119905

2) 119891119894119889119891 (119889 119905

119898)

radic119905119891119894119889119891(119889 1199051)2

119891119894119889119891(119889 1199052)2

119891119894119889119891(119889 119905119898)2

(4)




1 1198892) = 1198891sdot 1198892























1 1198782 119878

119898















Syria Crisis 642



rithms

1198651

1198651typicalvalue

1198651of original



1198651of improved



[015 025] 020 1 0[025 035] 030 2 0[035 045] 040 2 0[045 055] 050 4 0[055 065] 060 5 0[065 075] 070 7 9[075 085] 080 2 11[085 095] 090 1 8[095 100] 10 0 0



1distribution



6 Conclusion


References




















Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






1 1198892) = 1198891sdot 1198892























1 1198782 119878

119898















Syria Crisis 642



rithms

1198651

1198651typicalvalue

1198651of original



1198651of improved



[015 025] 020 1 0[025 035] 030 2 0[035 045] 040 2 0[045 055] 050 4 0[055 065] 060 5 0[065 075] 070 7 9[075 085] 080 2 11[085 095] 090 1 8[095 100] 10 0 0



1distribution



6 Conclusion


References




















Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










1 1198782 119878

119898















Syria Crisis 642



rithms

1198651

1198651typicalvalue

1198651of original



1198651of improved



[015 025] 020 1 0[025 035] 030 2 0[035 045] 040 2 0[045 055] 050 4 0[055 065] 060 5 0[065 075] 070 7 9[075 085] 080 2 11[085 095] 090 1 8[095 100] 10 0 0



1distribution



6 Conclusion


References




















Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in









Syria Crisis 642



rithms

1198651

1198651typicalvalue

1198651of original



1198651of improved



[015 025] 020 1 0[025 035] 030 2 0[035 045] 040 2 0[045 055] 050 4 0[055 065] 060 5 0[065 075] 070 7 9[075 085] 080 2 11[085 095] 090 1 8[095 100] 10 0 0



1distribution



6 Conclusion


References




















Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in













Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



Research Article Research on Hotspot Discovery in Internet ...

Documents

Transcript of Research Article Research on Hotspot Discovery in Internet ...