MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web...

20
MASSIVE-WEB CONTENT ANALYTIC BASED WEB CONTENT MINING BY HYBRID SCHEMES COMBINING ONTOLOGY AND FREQUENT ITEM CLUSTERING TECHNIQUES Nagappan, V.K 1 ,Dr. P. Elango 2 1 Research Scholar, Research and Development Center Bharathiar University, Coimbatore. 2 Assistant Professor, Dept. of Information Technology, PerunthalaivarKamarajar Institute of Engineering and Technology (PKIET) ( Govt. of Puducherry Institution) Nedungadu.Karaikal, UT of Puducherry. Corresponding authors: [email protected], [email protected] Abstract: In recent era, the field of massive-web content analytic has a rapid development. In massive-web content analytics, web content mining has challenging task for irregular classification, response analysis and web content summarization for the scope of devolving higher quality web content from websites. The conventional methods are not sufficient to the massive - web content analytics. A keyword based Web content mining approach allows the analyst to represent the composite structure of websites, to implement the knowledge about hierarchical structure of categories as well as to use the web content about relationships between websites and individual input data. We took the data corpus such as manually created Sports data-corpus, book library data-corpus for massive- web content analytics. Finally keyword based web content mining algorithm is most well-known techniques namely Hybrid Schemes combining ontology with web content mining techniques and Hybrid Schemes combining Ontology with Frequent Item Clustering method for improving efficiency and purity of massive- web content analytics. Keywords: web-content analytics, hierarchical structural categories, Clustering efficiency, purity I. Introduction Web-content mining is a process to achieve the excellence clustering of web-content information. The partition based algorithms such as K-means, EM, sGEM and rule mining based algorithms such as Apriori, FPGrowth, FP-Bonsai are useful methods for web-content mining. In addition to this all the partition based algorithms and rule mining based algorithms are used to International Journal of Pure and Applied Mathematics Volume 118 No. 20 2018, 2631-2649 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 2631

Transcript of MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web...

Page 1: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

MASSIVE-WEB CONTENT ANALYTIC BASED WEB CONTENT

MINING BY HYBRID SCHEMES COMBINING ONTOLOGY AND

FREQUENT ITEM CLUSTERING TECHNIQUES

Nagappan, V.K 1,Dr. P. Elango

2

1Research Scholar, Research and Development Center

Bharathiar University, Coimbatore.

2Assistant Professor, Dept. of Information Technology,

PerunthalaivarKamarajar Institute of Engineering and Technology (PKIET)

( Govt. of Puducherry Institution)

Nedungadu.Karaikal, UT of Puducherry.

Corresponding authors: [email protected], [email protected]

Abstract: In recent era, the field of massive-web content analytic has a rapid development. In

massive-web content analytics, web content mining has challenging task for irregular

classification, response analysis and web content summarization for the scope of devolving

higher quality web content from websites. The conventional methods are not sufficient to the

massive - web content analytics. A keyword based Web content mining approach allows the

analyst to represent the composite structure of websites, to implement the knowledge about

hierarchical structure of categories as well as to use the web content about relationships between

websites and individual input data. We took the data corpus such as manually created Sports

data-corpus, book library data-corpus for massive- web content analytics. Finally keyword based

web content mining algorithm is most well-known techniques namely Hybrid Schemes

combining ontology with web content mining techniques and Hybrid Schemes combining

Ontology with Frequent Item Clustering method for improving efficiency and purity of massive-

web content analytics.

Keywords: web-content analytics, hierarchical structural categories, Clustering efficiency,

purity

I. Introduction

Web-content mining is a process to achieve the excellence clustering of web-content

information. The partition based algorithms such as K-means, EM, sGEM and rule mining based

algorithms such as Apriori, FPGrowth, FP-Bonsai are useful methods for web-content mining. In

addition to this all the partition based algorithms and rule mining based algorithms are used to

International Journal of Pure and Applied MathematicsVolume 118 No. 20 2018, 2631-2649ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

2631

Page 2: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

group the relevant data in to the same cluster. But these algorithms have some drawbacks in the

objectives of web-content mining. To rectify this drawback, proposed research discusses about

the techniques for data group using “web content analytic” on manually crated Sports data-

corpus, book library data-corpus.

Figure 1.Web content mining by hybrid schemes combining ontology and frequent item

clustering

The categorization decides to which of a set of predefined categories a text belongs.

Furthermore, in predefined categories, the unknown text set are including according to a known

categorization when it is suitable. Finally, these categories are sorted based on them according

with well-known categories-based web content structure. The categories-based web content

structure organization is discussed under the process of frequent itemset based Clustering. This

structure organization reduces the average distance between the web content results. Frequent

itemset is a process of clustering and study about set of frequent items, which includes more

conceptual and contextual relevant meanings than individual web content results. The

performance results are discussing based on the manually created Sports data-corpus, book

library data-corpus with respect to the K-means, K-means+ PSO. Additionally, these existing

systems are comparing with the proposed system. This paper organized into following flow. The

introduction framework of this paper is described in section 1. Section 2 gives the literature

survey with the detail explanation in the existing work. Section 3 describes the overview of the

proposed web content mining by hybrid schemes combining ontology and frequent item

clustering. Section 4 examines hybrid scheme for text mining model. Section 5 contains the

example illustration details. Section 6 examines Experimental and Performance Evaluation

proposed system. Finally, summarized conclusion of the paper is discussed in section7.

International Journal of Pure and Applied Mathematics Special Issue

2632

Page 3: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

II. Literature Review

Research literature review discussed about the massive-web content analysis; In General,

including common ways of representing web content mining in ontology based clustering system

will move toward the web content analysis. This research applies the ontology framework, its

different categorization method of web content mining is overcome the challenges available in

the massive websites. This research gives a brief introduction to all the method useful for

massive-web content analysis, their outcome motivates partition and association of hybrid

schemes combining ontology and frequent item clustering techniques produce the accurate scope

of the research. In addition to that preprocessing are done by using Bag of Word process, Stop

word Elimination, Porters Stemmer. And the Feature Weighting and Selection are made by

Reduce the dimensionality on the web content mining.

The consideration of Christopher Fox (1989) removal of stop words the less useful and the

preprocessing step for text mining is Stop Words Removal in eliminate some rationally “on”,

“at”, “the” and “in” and so on. Those words are not essential for the mining process, for that

these words are removed from input data [1].

Paice Chris (1994) presented the various method for stemming algorithms which is useful to

make the eliminating the suffixes with the effective manner [2]. Also, refers the problems in

stemming for word standardization such as etymologically related words and the adding and

removing suffixes involve irregularities. Bradley et al., discussed the information extracting from

the Web corpus. Since a lot of Web data are found in HTML pages [Bradley et al (1998)]. Since

we use HTML, there are following processes are useful to extract the information from the web

pages such as locating the minimal object-rich sub tree and the set of objects is refined to

eliminate irrelevant objects.

Berry et al., (2004) discussed about the text Mining process of selecting the information from

different resources in order to find the relevant information from the data set corpus [4]. This

unknown discovered information is 80% of relevant information retrieved based on the feature

selection.

Zhilong et al (2011) discuss the dimensionality for produce the effective dimensionality

reduction. The traditional classification methods are not sufficient to meet the complexity [5].

The existing irrelevant and low accuracy of text categorization. To fulfill these constrain they

uses the effective dimensionality reduction to achieve the performance enhancement, complexity

reduction and storage space reduction for text categorization.

Gui-xian et al (2014) discussed the web text clustering with K-means and DBSCAN, so their

experimental [6] results shown the DBSCAN is better result than the K-means on the Tibetan

text clustering. Jaganathan et al (2013) discussed the efficiency with respect to execution time on

the web document clustering even the web document in the large dataset [7]. They introduce the

International Journal of Pure and Applied Mathematics Special Issue

2633

Page 4: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

various inter combination of hybrid process such Particle Swarm Optimization and K-means

algorithm such as PSOK, KPSO and KPSOK algorithms on various text document collections.

their model produces the quick and better clustering.

Badal,et al (2010) create the Frequent Data Itemset Mining for reduce complexity involved in

data mining from data warehouse [8]. Here the frequent data itemset in large scale database are

mined based on the VS_Apriori their works demonstrate to increase the efficient am speed of the

clustering process. This framework is very much useful to the intelligent data mining. Jamsheela

et al (2015) compare the various schema for the Frequent Itemset Mining Algorithms for

improve the performance of frequent pattern mining algorithms [9]. And their results show that

the FP-Tree based approach achieves better performance than Apriori.Ferenc Kov´acs et al

(2013) introduce the substantial frequent itemset mining for find the all of the frequent itemsets

and Generating rules from these large itemsets [10]. Here Apriori algorithms were used for

reducing the search space and count distribution algorithm used for itemset counting

synchronization. This framework also useful to execution time, response time of the Apriori

algorithm.

Twinkle et al (2014) discussed the semantics terms based constrain by the ontology to increase

the clustering process [11]. They overcome the issues related to the Selection of Features,

Dimensionality of Feature Space Process, Clustering Process, and Clustering Algorithm.

Tingting et al (2014) discussed about the semantic relationships among words [12], to improve

the quality of text clustering they form the ontology hierarchical structure for and they extracted

disambiguated core features extracted and improves clustering performance significantly. Liping

Jing et al (2006) uses the basis of ontologies-based distance measure to improve the performance

of text clustering. [13] Depending upon the term mutual information matrix of textual data is

formed the ontologies by vector space model. Their results show that ontologies-based distance

measure makes text clustering approaches perform better.

Ragunath et al (2015) proposes ontology model of hierarchical representation with concept

extraction algorithm [14]. Ontology-based summarizations compute a set of features for each

sentence. Anbarasi et al (2014) working with huge amount of database for information retrieval

has the problem to cluster so they use the clustering method in ontology with WorldNet

integration for identify the clustering process [15]. Clustering is happened by the k-means

algorithm. Finally, they are retrieving the information efficiently even from very large databases.

Paralic et al (2003) discussed about webocrat approach based on ontology for better retrieval

efficiency with respect to the recall, precision [16]. Additionally, TF-IDF and LSI approach are

useful to improve the Ontology based clustering.

Lei Zhang et al (2010) introduce the OFW-Clustering for with Feature weight calculation [17].

They produce the Ontology semantic node for Clustering result with feature weight is more

accurate than the other mode. They use the Euclidean Distance for measure the similarity

International Journal of Pure and Applied Mathematics Special Issue

2634

Page 5: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

measure. The Mobile value-added business database is the very much useful to implement those

process. Bloehdorn et al (2004) discussed about the text document knowledge, this frequently

appears based on the unsupervised text categorization [18]. Ontology-based text mining

frameworks to learn the target ontology from text documents with improve the effectiveness.

Hmwayet et al (2011) working with the large amounts of information for cluster the wit respect

to the Text Documents with the help of Ontology-Based Concept Weighting [19]. They achieve

the accurate document clustering by k-means with ontology this resolves the semantic problem

on the Google Search Engine effectiveness and practical value. Feng Luo et al (2002) resolves

the concept-based model problem using domain-dependent ontologies by self-organizing tree

algorithm [20]. They improve the effectiveness of the dataset such as Reuters21578 by the

hierarchical agglomerative clustering. All those ontology is generated by the automatically for

process as scalable one.

This review result summary provides the well understanding about the focus on enhancing

existing and proposed combined technique. To the best of the knowledge, a comparison between

hybrid schemes combining ontology and frequent item clustering techniques have not yet been

probed. The present research is designed to do such a performance comparison between two

areas such as frequent based item set clustering and web content mining and its ontology

framework.

III. Web Content Mining by Hybrid Schemes Combining Ontology and Frequent Item

Clustering

Due to the growth of rationally numeric data, graphical web data and textual data, Web content

mining has become complex process. Due to the growth of vast amount of web data, the

following issues such as information overload, semantically documents complexity, unstructured

view, less efficiency. To rectify and fulfill those constrains, this research work proposed frequent

based item set clustering and web content mining and its ontology framework.

3.1.Preprocessing and Feature Extraction

The data mining techniques have various preprocessing and feature extraction methods [21] to

achieve the common goal of preprocessing such as performance and optimal quality and the

context of web content mining. The pre-processing is used to present explicit form in to the

implicit form of web content, in clearly the text documents into simple word format. Initially text

is processed as a normal string, and then sequence of string terms are divided in to the simple

tokenized list of strings. These characterize the document processes helpfully to convert the

content of a document into a sequence of terms like words or phrases. These processes are useful

to remove the unwanted entities and words in sequence of words in an efficiently. These

preprocessing techniques improve the learning accuracy [22] and model interpretability with

International Journal of Pure and Applied Mathematics Special Issue

2635

Page 6: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

reduced computational cost. Moreover, it is not necessary to extract all features from the original

corpus vocabulary in the data model. Because of text mining tasks, it is extremely common to

remove basic functional words alone. The term filtering methods is used to improve the speed

and memory consumption of text clustering.

After removing the stop word, the Stemming process is used to reduce the number of unique

terms and reducing words to their stem or root form. This process is happening by the stemming

algorithm [23] [24] that converts different words form into similar canonical form that is

different constraint but for measuring similarity these should be considered same. After

stemming the two sentences [25] connection to connect and computing to compute just stay as

connect and compute and they are two different sentences.

Storage size, terms, over fitting, non-linear hypothesis spaces, irrelevant spaces, document dense

and dimensionality are the features of the text documents [26]. These features are available in

online web content. With respect to the grammatical tagging or word category disambiguation,

this process is marking up as a word in a text (corpus). Based on the word category

disambiguation related words are identified in the website. This identified word categories are

applied to the stop word elimination process. Moreover, Feature identification is an item that is

considered as an attribute of sample text. Feature Extraction is important to identify which items

will be featured for the document as they can represent different concepts due to language

ambiguity and also represented by pairs of words.

3.2.Hybrid Schemes Combining Ontology with Web Content Mining

Generally, ontology based web content mining have two categories of algorithms are convenient

for use of web content mining namely agglomerative (bottom-up approach) and divisive (top-

down approach) method; both the methods are used to mining the websites in various real

outcome such as homogeneous (objects) and heterogeneous (clusters), with respect to the

knowledge discovery, dimensionality reduction. Whereas Ontology based web content mining

forms cluster that has set of objects with the agglomerative criterion based on the preprocessing

similarity measure, in a simple manner similarity between the objects are formed as the Ontology

based web content mining. In this web content mining are described within this work are Hybrid

Schemes combining Ontology with web content mining. The goal of these schemas is to improve

performance of classification, clustering and retrieval of websites. To fulfill the process of web

content mining the well-known frequency weighting schema such as TF/IDF - semantic and DFT

weight Measure used to make the miming content with respect to the “term matrix” and

“semantic matrix” respectively. The final results show that introducing ontology concepts with

web content mining is promising and optimizing web content retrieval effectively.

International Journal of Pure and Applied Mathematics Special Issue

2636

Page 7: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

Figure 2.Combining Ontology with Web Content Mining

Web content mining works based on the different algorithm, in addition to that, the time attribute

is plays the great significance role in the content mining process because the when the huge

amount of web pages are performing the cluster formation. So, it took more time to process the

retrieval with respect to complexity.

Clearly, ontology helps to know what it means for a specific term with respect to the combined

process of web content mining and feature selection. Ontologies provide a way to describe the

meaning of the terms and relationships of the web content mining. Before submitting the concept

ontology, the web content should have processed by the following procedures with the help of

the ontology collection, identify concept in each website based on the ontology model. These

identified terms are matched with concepts of synonyms, meronyms and hypernyms in the

ontology. In this era, the input data applying to Ontology techniques for web content mining.

From that, construct many Frequent Item Clustering web content mining representations which is

happened by the multiple clustering results.

International Journal of Pure and Applied Mathematics Special Issue

2637

Page 8: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

Figure 3. Algorithm for Ontology with Web Content Mining

Ontology based web content mining are compares the performance with existing traditional

system. In this web content mining are described within this work are Hybrid Schemes

combining Ontology and web content mining techniques. Many approaches are beginning to deal

with the Hybrid Schemes combining Ontology and web content mining techniques especially the

term frequency weight statistics of key words. The goal of these weighting schemas is to

improve performance of classification, clustering of website retrieval from manually created

sports dataset. The final results show that introducing ontology concepts with web content

mining is promising and improves efficiency.

IV. Hybrid Scheme for Text Mining Model

To make the appropriate web page retrieval is a very difficult process because this content

mining process uses the traditional algorithm alone, but is not efficient to make the efficient

retrieval presence of large size of dataset. To reduce these conflicts, we use the two level of

clustering process such as Hybrid Schemes combining ontology with web content mining

techniques and Hybrid Schemes combining Ontology with Frequent Item Clustering.

This study proposes Hybrid Scheme for Text mining scheme by combining Ontology with

the frequent item clustering. It refers to set of objects, concepts and other relationships together

known as ontology entities describing information within an application domain.

Get input from pre-processed data as package count

set pkg_counter Zero

start pkg_count= pkg_counter;

construct OWL model(pkg(pkg_counter));

pkg_counter++; continue;

start set class_counter Zero

class_count= class_counter;

construct OWL model(class(class_counter));

class_counter++; continue;

start set param_counter Zero

param_count= param _counter;

start set method_counter Zero

method _count= method _counter;

construct OWL model(method(method _counter));

method _counter++; continue;

construct OWL model(param(param _counter));

param_counter++; break;

Result: Write the OWL model

do until web content mining accurate website reach

International Journal of Pure and Applied Mathematics Special Issue

2638

Page 9: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

In this method ontology helps in identifying similarity between set of concepts. This helps in

retrieving and filtering information in automatic way. For example, the keyword “volleyball”

may occur three times, the data is mapped against the corresponding word file format. The term

frequency values are computed and mapped with corresponding document id. This can be done

using query interface, Bayes Learning is applied in ontology distance measure. The term is

matched with concept of ontology using Bayes’ learning.

P(w/c) =P(c/w) X P(t) / P(c) . Where w is the word in query or document and c is the

concept in ontology. It provides a way to identify which entity classes are most like each other.

Bayes’ learning produced the ontology and frequent item clustering. In the research, the

mining websites into groups containing similar content and websites together, based on the

clusters formed. After preprocessing the Web Content Mining techniques like Keyword based

search, TF-IDF based search, Weight comparison, Frequent Item Clustering measured for

determining Mining Efficiency. Finally, it will produce the Efficient Retrieved web pages

The implementation result shows that the Web Content Mining techniques forms the

different similarity metrics that gives different mining results to analyze the metrics used for

within the less execution time interval.

For both the Web Content Mining techniques are merge together called Hybrid Schemes

combining ontology with web content mining techniques, which first used to cluster the web

pages and these clustered web pages are applied to Hybrid Schemes combining Ontology with

Frequent Item Clustering for clustering the retrieved webpages efficiently. This multi-stage

approach of Hybrid Scheme for ontology consists of sequential process. Here the ontology very

useful for make a knowledge repository in which concepts and terms are defined and the

relationships between these concepts and ontology automates information processing and can

facilitate text mining in a domain. Thus, Ontology is a vocabulary for metadata that machines

can understand which helps to automate the information processing. So, the ontology based text

content mining considers only similar terms without considering the meaning of the terms.

With regard to, ontology based content mining any dissimilar different data wound be

consider as the different documents unless some semantic meaning is added to it. If we add some

semantic meaning, then both documents can be related with each other under the category of

particular domain.

The following derivation used to execute the text term (T) based processing for efficient

clustering. If we consider the dataset (D) contains many documents (n) which is represented as

D = {Dj; j = 1 ... nD}

And its term represented as the,

International Journal of Pure and Applied Mathematics Special Issue

2639

Page 10: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

T = {t i; i = 1...nT}

Here input has been given to the cluster without irrelevant term of websites. The combination

of distribution-based measure and behavioral characteristics are calculated based on the distance

measure is used as similarity measure calculation which is improve the overall clustering

performance.

To calculating the document distance measure, from the below equation, produces the

following vector V*. V*(D) represents the document description and is a set of terms where each

term “t” is associated with its normalized term frequency “tf”. Thus, the all the element of vector

V*(Di) can be calculated using Equation.

Where tfk,i is the frequency of the Kth

term in text Di. Thus V* represents a text as a vector

using term frequencies to set weights associated to each element. The “i” represents the term

occurs in document Di, The distance between a pair of input text (f)

(Di, Dj) is calculated using

Equation and is represented as,

( ) [∑|

|

]

In this proposed model, here "p” is phrase in the web content mining and therefore actually

implements Manhattan distance metric the second vector V** takes into consideration the

structural properties of a text and is represented as a set of probability distributions associated

with the term vector. Here, each term t(T) occurring in a dataset D is associated with a

distribution function that gives the spatial probability density function (PDF) of “t” in D. Such a

distribution, pt, u(s) is generated under the hypothesis detecting the Kth

occurrence of a term “t”

at the normalized position Sk [0,1] in the text, the spatial pdf of the term can be approximated by

a Gaussian distribution centered around sk. In other words, if the term tj is found at position sk

within a document, a second dataset with similar structure is expected to include the same term

at the same position or in a neighborhood thereof, with a probability defined by a Gaussian pdf.

To derive a formal expression of the pdf, assume that the ith

text, Di, holds no occurrences of

terms after simplifications.

The spatial pdf is defined using Equation.

International Journal of Pure and Applied Mathematics Special Issue

2640

Page 11: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

Where A and are normalization terms, G is the Gaussian pdf given by Equation

√ [

]

From this the second term vector V** is calculated by considering a discrete approximation

of Equation. Here, the dataset D is segmented evenly into S sections, from which S-dimensional

vectors are generated for each term t T. Each element estimates the probability of a term „t‟

occurring in the corresponding section of the text. Thus, v** (D) is represented as an array of nT

vectors having dimension S. The distance between the probability vectors thus created (V**) is

calculated by using Euclidean metric (Equation 4.7) and is represented as (b) for two documents

Di and Dj.

( ) ∑ ( )

∑∑[

]

From the calculated Δ(f) and Δ(b), the final distance is calculated using Equation.

( ) ( ) ( )

Here [0, 1] is the mixing coefficient weight. For the last stage, that is the actual

clustering,

To find the relations between individual terms and its objects of sentences even complex

structure of sentences the ontology combined frequent itemset clustering approach facilitates the

implementation of knowledge about hierarchical structure of categories. The proposed web

content mining combines the concept clustering algorithms.

The following steps are used to find the Hybrid schemes.

Step 1: select an appropriate group of keywords based on the proper noun and proper verb on

the corpus.

Step 2: Depending upon the weights of the input text cluster the retrieved web content is

arranged and for each website which is called weighting scheme based on ontology.

Step 3: Finally, Ontology tree structure is formed based on the explanation and formulating it

in a particular domain. This is useful for interoperability between these ontologies.

It identifies irrelevant or redundant features for achieve accurate text clustering.

International Journal of Pure and Applied Mathematics Special Issue

2641

Page 12: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

V. Example Illustration

The following example illustrates the dependency graph and ontology. it has the input from

the three-different text such as

Text 1: extra, skater

Text 2: Hockey, reference

Text 3: women, hockey, stats

Above three text are taken in to the element of the formation of ontology moreover the

depending upon the noun and verb the selected words are underlined as the below set of web

sites.

Web site 1: http://www.extraskater.com/

Web site 2: https://www.hockey-reference.com/

Web site 3: http://womenshockeystats.com/?nr=0

To make clear validity of the ontology some small text is selected form the total words

depends on the noun and verb. These words are useful to make the ontology tree other than these

tree elements are unknown, because these elements have no entry in the lexicon so it was

reduced for the web content mining. To satisfy this contradiction, we go for Ontology

specification of conceptualizations.

This will have formed as the following dependency graph,

Figure 4. Resulted hierarchy after processing the Site URLs

International Journal of Pure and Applied Mathematics Special Issue

2642

Page 13: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

From the above dependency graph initially 300 websites from the manually created sports data

corpus. The calculated dependency graph is used to find the reduction for each text set in the data

corpus. Finally, the reduction rate is increased in all input. This result can be seen that the

process of constructing dependency graph was able to solve the sparsity (impact processing time)

problem by reducing the number of features dependency graph based on the ontology.

VI. Empirical Evaluation

6.1. Datasets and Setup

Sports dataset has a collection of real-world sports stories in the English language under different

categories. Totally Sports dataset has many data, each moderately consisting of different

websites. The citation and detail about all entries are available for each data which includes Date,

Topics, Title, and content part.

6.2. Experimental Results

The performance measures of precision, recall, F-measure are calculated based on the

following calculation depending upon the web content mining scenario such as Front end of

Visual basic. These are the input data for web content mining raw data and the cleaned data

section.

Figure 5.web content mining raw data and the cleaned data

Moreover, My SQL connector is used to the setting up saved the ontology history and

retrieval process. The performance measures are calculated depending upon the following

equations,

International Journal of Pure and Applied Mathematics Special Issue

2643

Page 14: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

Figure 6. The ontology history and retrieval process

The above graph shows the sports dataset elements. This sports dataset elements are used in

this research to makes text mining and web sites classification as the effective and efficient way.

The following graph shows the precision and recall chart of the extracted above rules and its

measures.

International Journal of Pure and Applied Mathematics Special Issue

2644

Page 15: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

Figure 7. Precision and recall comparison

The above represented graph of figure 7 gives the precision and recall individually depending

upon dataset corpus and useful to analyze the F- measure [5].

Figure 8. the precision, recall, F-Measure

The above figure 8 shows the various performance measures of the web content mining

depending upon the dataset corpus with respect to the hybrid schemes combining ontology and

frequent item clustering techniques results. It also compares the performance measure of

precision, recall, f-measure.

Constructing the web content retrieval framework based on the ontology can reduce the memory

size with respect to frequent item clustering. Moreover, the semantic problems as well through

extracting only the meaningful sentences are also performed. In figure 4 shows that they

represent dependency relations between the Site URLs as well as display the whole sentence

meaning in the selected verb and noun on the dependency graph. The final result deals with the

separate sentences are labeled with the grammatical function as well as mono meaningful words.

In contrast to phrase structure grammar therefore, dependency grammars can be used to directly

express grammatical functions as a type of dependency graph with the aim of improving the text

mining results.

International Journal of Pure and Applied Mathematics Special Issue

2645

Page 16: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

VII. Conclusion

In web content mining world, the eventual goal is to help users to find and list the appropriate

websites they need in a fast and accurate manner depends upon the keyword. Let us note that a

very wide range of information is available in the World Wide Web. The extracting the web

results on World Wide Web be treated by Web crawler. It has been of successful use in

applications concerning web pages, multi-threaded downloader, URL scheduler, text and meta-

data storage, web data and more. The usage of web content mining techniques on text documents

are here known as intelligent web content analysis. Especially in the field of Internet

accessibility and web content mining are increasing the magnitude and usage of web crawler

based on the web content mining. The Keyword based and URL based Links are extracted from

websites and move forward on crawl. Web content mining approach is the process of similarity

among the source ontologies is calculates with the aim of defining the order in which they will

be merging. This approach is to discover knowledge from websites is taking as the discussion of

effective methods like discovers relevant webpages, “keyword and url based links extraction”. In

this research work, Web content mining approach and discovered patterns used to find accurate

relevant features in a World Wide Web. Based on the keyword, attributes, synonym, homonym

with respect to the “web content mining” and “clustering based mining” are well derived

specification for the web content mining. In such a way, this research is efficient in terms of web

content mining. Because it has efficient result in duplication occurrence, knowledge discovery,

works in dynamic web page content, automatic topic extraction, Fast information retrieval.

Moreover, to all those processes are made easy and efficient and its results proving ontology

with web content mining is promising and improves web content mining process. Finally, our

research technology provides the “mining accuracy”, “complexity” and “time space” is improves

for quality of the content mining results problem on extraction process.

References

1. Christopher Fox. A stop list for general text. In ACM SIGIR Forum, volume 24, pages

19{21. ACM, 1989.

2. Paice Chris D. “An evaluation method for stemming algorithms”. Proceedings of the 17th

annual international ACM SIGIR conference on Research and development in

information retrieval. 1994, 42- 50.

3. P. Bradley, U. Fayyad, and C. Reina, “Scaling clustering algorithms to large databases, ”

In Proc. of KDD-1998, New York, NY, USA, August 1998, pages 9–15, Menlo Park,

CA, USA, 1998. AAAI Press.

4. Berry Michael W., (2004), “Automatic Discovery of Similar Words”, in “Survey of Text

Mining: Clustering, Classification and Retrieval”, Springer Verlag, New York, LLC, 24-

43.

International Journal of Pure and Applied Mathematics Special Issue

2646

Page 17: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

5. Zhilong Zhen, Haijuan Wang, Lixin Han, Zhan Shi, “Categorical Document Frequency

Based Feature Selection for Text Categorization”, IEEE International Conference of

Information Technology, Computer Engineering and Management Sciences, IEEE

computer society.2011.

6. Gui-xian Xu, Li-rong Qiu, Lu Yang, “Tibetan Text Clustering Based on Machine

Learning”, Journal of Digital Information Management, Volume 12, Number 3,June

2014.

7. P Jaganathan, S Jaiganesh, “An improved K-means algorithm combined with Particle

Swarm Optimization approach for efficient web document clustering”, IEEE

International Conference on Green Computing, Communication and Conservation of

Energy (ICGCE), pp 772-776, 2013.

8. Badal, Shruti Tripathi, “Frequent Data Itemset Mining Using VS_Apriori Algorithms”,

International Journal on Computer Science and Engineering Vol. 02, No. 04, 1111-1118,

2010

9. O.Jamsheela, Raju.G, “Frequent Itemset Mining Algorithms: A Literature Survey”, IEEE

International Advance Computing Conference (IACC), pp: 1099 – 1104, 2015.

10. Ferenc Kov´acs, J´anos Ill´es, “Frequent Itemset Mining on Hadoop”, IEEE 9th

International Conference on Computational Cybernetics, pp: 241-245, July 2013.

11. Twinkle S vadasa, Jasmine Jhab, “A Literature Survey on Text Document Clustering and

Ontology based techniques”, International Journal of Innovative and Emerging Research

in Engineering Volume 1, Issue 2, 2014.

12. Tingting Wei, Yonghe Lu, Huiyou Chang, Qiang Zhou, Xianyu Bao, “A semantic

approach for text clustering using WordNet and lexical chains”, Published by Elsevier

Ltd.journal homepage: www.elsevier.com/locate/eswa.

13. Allimuthu, U.: BAU FAM: biometric-blacklisting anonymous users using fictitious and

adroit manager. J. Adv. Res. Dyn. Control Syst. (12-Special), 722–737 (2017)

14. Ragunath, Sivaranjani, “Ontology Based Text Document Summarization System Using

Concept Terms”, ARPN Journal of Engineering and Applied Sciences, VOL. 10, NO. 6,

APRIL 2015

15. Anbarasi, Iswarya, Sindhuja, Yogabindiya, “ONTOLOGY ORIENTED CONCEPT

BASED CLUSTERING”, IJRET: International Journal of Research in Engineering and

Technology, Volume: 03 Issue: 02, Feb-2014.

16. J. Paralic and I. Kostial, "Ontology-based Information Retrieval," Pmc: of Ihp 141h Int.

Co'lf. on Inform. and intelligent Syst. (liS 2003), Varazdin, Croatia, pp. 23-28 , 2003

17. Lei ZHANG, Zhichao WANG, Ontology-based Clustering Algorithm with Feature

Weights, “”, pp: 2959- 2966, 2010.

International Journal of Pure and Applied Mathematics Special Issue

2647

Page 18: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

18. Bloehdorn, Cimiano, Hotho, Staab, “An Ontology-based Framework for Text Mining”,

July 2004.

19. Hmway Hmway Tar, Thi Thi Soe Nyunt, “Ontology-Based Concept Weighting for Text

Documents”, International Conference on Information Communication and Management

IPCSIT vol.16, IACSIT Press, Singapore, 2011

20. L.; Feng Luo, "Ontology construction for information selection," in Tools with Artificial

Intelligence, 2002. (ICTAI 2002). Proceedings. 14th IEEE International Conference on ,

vol., no., pp.122-127, 2002

21. V. Srividhya, R. Anitha, “Evaluating Preprocessing Techniques in Text Categor zation”,

International Journal of Computer Science and Application Issue 2010.

22. Sirsat S. R, Vinay Chavan, mahalle .H.S, “Strength and Accuracy Analysis of Affix

Removal Stemming Algorithms”, International Journal of Computer Science and

Information Technologies, Vol. 4 (2), 265 – 269, 2013.

23. M.Thangarasu, Dr.R.Manavalan “A Literature Review: Stemming Algorithms for Indian

Languages” International Journal of Computer Trends and Technology (IJCTT) – volume

4 Issue 8–August 2013 ISSN: 2231-2803 Page 2582-2584

24. Julie Beth Lovins “Development of a Stemming Algorithm”, Mechanical Translation and

Computational Linguistics, vol.11, nos.1 and 2, March and June 1968.

25. Levent ÄOzgÄur and Tunga GÄungÄor “Analysis of Stemming Alternatives and

Dependency Pattern Support in Text Classification”, Advances in Computational

Linguistics Research in Computing Science, 2009, pp. 195-206, 2009.

26. Boulis, C. and Ostendorf, M. (2005). Text classification by augmenting the bag-of-words

representation with redundancy compensated bigrams. In Proceedings of the International

Workshop on Feature Selection in Data Mining, in conjunction with SIAM SDM-, pages

9-16, 2005

International Journal of Pure and Applied Mathematics Special Issue

2648

Page 19: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

2649

Page 20: MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web content mining has challenging task for irregular classification, response analysis

2650