6 Ways to Build a Massive Audience with Content Marketing - #Authority2015
MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web...
Transcript of MASSIVE -WEB CONTENT ANALYTIC BASED WEB CONTENT MINING … · massive -web content analytics, web...
MASSIVE-WEB CONTENT ANALYTIC BASED WEB CONTENT
MINING BY HYBRID SCHEMES COMBINING ONTOLOGY AND
FREQUENT ITEM CLUSTERING TECHNIQUES
Nagappan, V.K 1,Dr. P. Elango
2
1Research Scholar, Research and Development Center
Bharathiar University, Coimbatore.
2Assistant Professor, Dept. of Information Technology,
PerunthalaivarKamarajar Institute of Engineering and Technology (PKIET)
( Govt. of Puducherry Institution)
Nedungadu.Karaikal, UT of Puducherry.
Corresponding authors: [email protected], [email protected]
Abstract: In recent era, the field of massive-web content analytic has a rapid development. In
massive-web content analytics, web content mining has challenging task for irregular
classification, response analysis and web content summarization for the scope of devolving
higher quality web content from websites. The conventional methods are not sufficient to the
massive - web content analytics. A keyword based Web content mining approach allows the
analyst to represent the composite structure of websites, to implement the knowledge about
hierarchical structure of categories as well as to use the web content about relationships between
websites and individual input data. We took the data corpus such as manually created Sports
data-corpus, book library data-corpus for massive- web content analytics. Finally keyword based
web content mining algorithm is most well-known techniques namely Hybrid Schemes
combining ontology with web content mining techniques and Hybrid Schemes combining
Ontology with Frequent Item Clustering method for improving efficiency and purity of massive-
web content analytics.
Keywords: web-content analytics, hierarchical structural categories, Clustering efficiency,
purity
I. Introduction
Web-content mining is a process to achieve the excellence clustering of web-content
information. The partition based algorithms such as K-means, EM, sGEM and rule mining based
algorithms such as Apriori, FPGrowth, FP-Bonsai are useful methods for web-content mining. In
addition to this all the partition based algorithms and rule mining based algorithms are used to
International Journal of Pure and Applied MathematicsVolume 118 No. 20 2018, 2631-2649ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu
2631
group the relevant data in to the same cluster. But these algorithms have some drawbacks in the
objectives of web-content mining. To rectify this drawback, proposed research discusses about
the techniques for data group using “web content analytic” on manually crated Sports data-
corpus, book library data-corpus.
Figure 1.Web content mining by hybrid schemes combining ontology and frequent item
clustering
The categorization decides to which of a set of predefined categories a text belongs.
Furthermore, in predefined categories, the unknown text set are including according to a known
categorization when it is suitable. Finally, these categories are sorted based on them according
with well-known categories-based web content structure. The categories-based web content
structure organization is discussed under the process of frequent itemset based Clustering. This
structure organization reduces the average distance between the web content results. Frequent
itemset is a process of clustering and study about set of frequent items, which includes more
conceptual and contextual relevant meanings than individual web content results. The
performance results are discussing based on the manually created Sports data-corpus, book
library data-corpus with respect to the K-means, K-means+ PSO. Additionally, these existing
systems are comparing with the proposed system. This paper organized into following flow. The
introduction framework of this paper is described in section 1. Section 2 gives the literature
survey with the detail explanation in the existing work. Section 3 describes the overview of the
proposed web content mining by hybrid schemes combining ontology and frequent item
clustering. Section 4 examines hybrid scheme for text mining model. Section 5 contains the
example illustration details. Section 6 examines Experimental and Performance Evaluation
proposed system. Finally, summarized conclusion of the paper is discussed in section7.
International Journal of Pure and Applied Mathematics Special Issue
2632
II. Literature Review
Research literature review discussed about the massive-web content analysis; In General,
including common ways of representing web content mining in ontology based clustering system
will move toward the web content analysis. This research applies the ontology framework, its
different categorization method of web content mining is overcome the challenges available in
the massive websites. This research gives a brief introduction to all the method useful for
massive-web content analysis, their outcome motivates partition and association of hybrid
schemes combining ontology and frequent item clustering techniques produce the accurate scope
of the research. In addition to that preprocessing are done by using Bag of Word process, Stop
word Elimination, Porters Stemmer. And the Feature Weighting and Selection are made by
Reduce the dimensionality on the web content mining.
The consideration of Christopher Fox (1989) removal of stop words the less useful and the
preprocessing step for text mining is Stop Words Removal in eliminate some rationally “on”,
“at”, “the” and “in” and so on. Those words are not essential for the mining process, for that
these words are removed from input data [1].
Paice Chris (1994) presented the various method for stemming algorithms which is useful to
make the eliminating the suffixes with the effective manner [2]. Also, refers the problems in
stemming for word standardization such as etymologically related words and the adding and
removing suffixes involve irregularities. Bradley et al., discussed the information extracting from
the Web corpus. Since a lot of Web data are found in HTML pages [Bradley et al (1998)]. Since
we use HTML, there are following processes are useful to extract the information from the web
pages such as locating the minimal object-rich sub tree and the set of objects is refined to
eliminate irrelevant objects.
Berry et al., (2004) discussed about the text Mining process of selecting the information from
different resources in order to find the relevant information from the data set corpus [4]. This
unknown discovered information is 80% of relevant information retrieved based on the feature
selection.
Zhilong et al (2011) discuss the dimensionality for produce the effective dimensionality
reduction. The traditional classification methods are not sufficient to meet the complexity [5].
The existing irrelevant and low accuracy of text categorization. To fulfill these constrain they
uses the effective dimensionality reduction to achieve the performance enhancement, complexity
reduction and storage space reduction for text categorization.
Gui-xian et al (2014) discussed the web text clustering with K-means and DBSCAN, so their
experimental [6] results shown the DBSCAN is better result than the K-means on the Tibetan
text clustering. Jaganathan et al (2013) discussed the efficiency with respect to execution time on
the web document clustering even the web document in the large dataset [7]. They introduce the
International Journal of Pure and Applied Mathematics Special Issue
2633
various inter combination of hybrid process such Particle Swarm Optimization and K-means
algorithm such as PSOK, KPSO and KPSOK algorithms on various text document collections.
their model produces the quick and better clustering.
Badal,et al (2010) create the Frequent Data Itemset Mining for reduce complexity involved in
data mining from data warehouse [8]. Here the frequent data itemset in large scale database are
mined based on the VS_Apriori their works demonstrate to increase the efficient am speed of the
clustering process. This framework is very much useful to the intelligent data mining. Jamsheela
et al (2015) compare the various schema for the Frequent Itemset Mining Algorithms for
improve the performance of frequent pattern mining algorithms [9]. And their results show that
the FP-Tree based approach achieves better performance than Apriori.Ferenc Kov´acs et al
(2013) introduce the substantial frequent itemset mining for find the all of the frequent itemsets
and Generating rules from these large itemsets [10]. Here Apriori algorithms were used for
reducing the search space and count distribution algorithm used for itemset counting
synchronization. This framework also useful to execution time, response time of the Apriori
algorithm.
Twinkle et al (2014) discussed the semantics terms based constrain by the ontology to increase
the clustering process [11]. They overcome the issues related to the Selection of Features,
Dimensionality of Feature Space Process, Clustering Process, and Clustering Algorithm.
Tingting et al (2014) discussed about the semantic relationships among words [12], to improve
the quality of text clustering they form the ontology hierarchical structure for and they extracted
disambiguated core features extracted and improves clustering performance significantly. Liping
Jing et al (2006) uses the basis of ontologies-based distance measure to improve the performance
of text clustering. [13] Depending upon the term mutual information matrix of textual data is
formed the ontologies by vector space model. Their results show that ontologies-based distance
measure makes text clustering approaches perform better.
Ragunath et al (2015) proposes ontology model of hierarchical representation with concept
extraction algorithm [14]. Ontology-based summarizations compute a set of features for each
sentence. Anbarasi et al (2014) working with huge amount of database for information retrieval
has the problem to cluster so they use the clustering method in ontology with WorldNet
integration for identify the clustering process [15]. Clustering is happened by the k-means
algorithm. Finally, they are retrieving the information efficiently even from very large databases.
Paralic et al (2003) discussed about webocrat approach based on ontology for better retrieval
efficiency with respect to the recall, precision [16]. Additionally, TF-IDF and LSI approach are
useful to improve the Ontology based clustering.
Lei Zhang et al (2010) introduce the OFW-Clustering for with Feature weight calculation [17].
They produce the Ontology semantic node for Clustering result with feature weight is more
accurate than the other mode. They use the Euclidean Distance for measure the similarity
International Journal of Pure and Applied Mathematics Special Issue
2634
measure. The Mobile value-added business database is the very much useful to implement those
process. Bloehdorn et al (2004) discussed about the text document knowledge, this frequently
appears based on the unsupervised text categorization [18]. Ontology-based text mining
frameworks to learn the target ontology from text documents with improve the effectiveness.
Hmwayet et al (2011) working with the large amounts of information for cluster the wit respect
to the Text Documents with the help of Ontology-Based Concept Weighting [19]. They achieve
the accurate document clustering by k-means with ontology this resolves the semantic problem
on the Google Search Engine effectiveness and practical value. Feng Luo et al (2002) resolves
the concept-based model problem using domain-dependent ontologies by self-organizing tree
algorithm [20]. They improve the effectiveness of the dataset such as Reuters21578 by the
hierarchical agglomerative clustering. All those ontology is generated by the automatically for
process as scalable one.
This review result summary provides the well understanding about the focus on enhancing
existing and proposed combined technique. To the best of the knowledge, a comparison between
hybrid schemes combining ontology and frequent item clustering techniques have not yet been
probed. The present research is designed to do such a performance comparison between two
areas such as frequent based item set clustering and web content mining and its ontology
framework.
III. Web Content Mining by Hybrid Schemes Combining Ontology and Frequent Item
Clustering
Due to the growth of rationally numeric data, graphical web data and textual data, Web content
mining has become complex process. Due to the growth of vast amount of web data, the
following issues such as information overload, semantically documents complexity, unstructured
view, less efficiency. To rectify and fulfill those constrains, this research work proposed frequent
based item set clustering and web content mining and its ontology framework.
3.1.Preprocessing and Feature Extraction
The data mining techniques have various preprocessing and feature extraction methods [21] to
achieve the common goal of preprocessing such as performance and optimal quality and the
context of web content mining. The pre-processing is used to present explicit form in to the
implicit form of web content, in clearly the text documents into simple word format. Initially text
is processed as a normal string, and then sequence of string terms are divided in to the simple
tokenized list of strings. These characterize the document processes helpfully to convert the
content of a document into a sequence of terms like words or phrases. These processes are useful
to remove the unwanted entities and words in sequence of words in an efficiently. These
preprocessing techniques improve the learning accuracy [22] and model interpretability with
International Journal of Pure and Applied Mathematics Special Issue
2635
reduced computational cost. Moreover, it is not necessary to extract all features from the original
corpus vocabulary in the data model. Because of text mining tasks, it is extremely common to
remove basic functional words alone. The term filtering methods is used to improve the speed
and memory consumption of text clustering.
After removing the stop word, the Stemming process is used to reduce the number of unique
terms and reducing words to their stem or root form. This process is happening by the stemming
algorithm [23] [24] that converts different words form into similar canonical form that is
different constraint but for measuring similarity these should be considered same. After
stemming the two sentences [25] connection to connect and computing to compute just stay as
connect and compute and they are two different sentences.
Storage size, terms, over fitting, non-linear hypothesis spaces, irrelevant spaces, document dense
and dimensionality are the features of the text documents [26]. These features are available in
online web content. With respect to the grammatical tagging or word category disambiguation,
this process is marking up as a word in a text (corpus). Based on the word category
disambiguation related words are identified in the website. This identified word categories are
applied to the stop word elimination process. Moreover, Feature identification is an item that is
considered as an attribute of sample text. Feature Extraction is important to identify which items
will be featured for the document as they can represent different concepts due to language
ambiguity and also represented by pairs of words.
3.2.Hybrid Schemes Combining Ontology with Web Content Mining
Generally, ontology based web content mining have two categories of algorithms are convenient
for use of web content mining namely agglomerative (bottom-up approach) and divisive (top-
down approach) method; both the methods are used to mining the websites in various real
outcome such as homogeneous (objects) and heterogeneous (clusters), with respect to the
knowledge discovery, dimensionality reduction. Whereas Ontology based web content mining
forms cluster that has set of objects with the agglomerative criterion based on the preprocessing
similarity measure, in a simple manner similarity between the objects are formed as the Ontology
based web content mining. In this web content mining are described within this work are Hybrid
Schemes combining Ontology with web content mining. The goal of these schemas is to improve
performance of classification, clustering and retrieval of websites. To fulfill the process of web
content mining the well-known frequency weighting schema such as TF/IDF - semantic and DFT
weight Measure used to make the miming content with respect to the “term matrix” and
“semantic matrix” respectively. The final results show that introducing ontology concepts with
web content mining is promising and optimizing web content retrieval effectively.
International Journal of Pure and Applied Mathematics Special Issue
2636
Figure 2.Combining Ontology with Web Content Mining
Web content mining works based on the different algorithm, in addition to that, the time attribute
is plays the great significance role in the content mining process because the when the huge
amount of web pages are performing the cluster formation. So, it took more time to process the
retrieval with respect to complexity.
Clearly, ontology helps to know what it means for a specific term with respect to the combined
process of web content mining and feature selection. Ontologies provide a way to describe the
meaning of the terms and relationships of the web content mining. Before submitting the concept
ontology, the web content should have processed by the following procedures with the help of
the ontology collection, identify concept in each website based on the ontology model. These
identified terms are matched with concepts of synonyms, meronyms and hypernyms in the
ontology. In this era, the input data applying to Ontology techniques for web content mining.
From that, construct many Frequent Item Clustering web content mining representations which is
happened by the multiple clustering results.
International Journal of Pure and Applied Mathematics Special Issue
2637
Figure 3. Algorithm for Ontology with Web Content Mining
Ontology based web content mining are compares the performance with existing traditional
system. In this web content mining are described within this work are Hybrid Schemes
combining Ontology and web content mining techniques. Many approaches are beginning to deal
with the Hybrid Schemes combining Ontology and web content mining techniques especially the
term frequency weight statistics of key words. The goal of these weighting schemas is to
improve performance of classification, clustering of website retrieval from manually created
sports dataset. The final results show that introducing ontology concepts with web content
mining is promising and improves efficiency.
IV. Hybrid Scheme for Text Mining Model
To make the appropriate web page retrieval is a very difficult process because this content
mining process uses the traditional algorithm alone, but is not efficient to make the efficient
retrieval presence of large size of dataset. To reduce these conflicts, we use the two level of
clustering process such as Hybrid Schemes combining ontology with web content mining
techniques and Hybrid Schemes combining Ontology with Frequent Item Clustering.
This study proposes Hybrid Scheme for Text mining scheme by combining Ontology with
the frequent item clustering. It refers to set of objects, concepts and other relationships together
known as ontology entities describing information within an application domain.
Get input from pre-processed data as package count
set pkg_counter Zero
start pkg_count= pkg_counter;
construct OWL model(pkg(pkg_counter));
pkg_counter++; continue;
start set class_counter Zero
class_count= class_counter;
construct OWL model(class(class_counter));
class_counter++; continue;
start set param_counter Zero
param_count= param _counter;
start set method_counter Zero
method _count= method _counter;
construct OWL model(method(method _counter));
method _counter++; continue;
construct OWL model(param(param _counter));
param_counter++; break;
Result: Write the OWL model
do until web content mining accurate website reach
International Journal of Pure and Applied Mathematics Special Issue
2638
In this method ontology helps in identifying similarity between set of concepts. This helps in
retrieving and filtering information in automatic way. For example, the keyword “volleyball”
may occur three times, the data is mapped against the corresponding word file format. The term
frequency values are computed and mapped with corresponding document id. This can be done
using query interface, Bayes Learning is applied in ontology distance measure. The term is
matched with concept of ontology using Bayes’ learning.
P(w/c) =P(c/w) X P(t) / P(c) . Where w is the word in query or document and c is the
concept in ontology. It provides a way to identify which entity classes are most like each other.
Bayes’ learning produced the ontology and frequent item clustering. In the research, the
mining websites into groups containing similar content and websites together, based on the
clusters formed. After preprocessing the Web Content Mining techniques like Keyword based
search, TF-IDF based search, Weight comparison, Frequent Item Clustering measured for
determining Mining Efficiency. Finally, it will produce the Efficient Retrieved web pages
The implementation result shows that the Web Content Mining techniques forms the
different similarity metrics that gives different mining results to analyze the metrics used for
within the less execution time interval.
For both the Web Content Mining techniques are merge together called Hybrid Schemes
combining ontology with web content mining techniques, which first used to cluster the web
pages and these clustered web pages are applied to Hybrid Schemes combining Ontology with
Frequent Item Clustering for clustering the retrieved webpages efficiently. This multi-stage
approach of Hybrid Scheme for ontology consists of sequential process. Here the ontology very
useful for make a knowledge repository in which concepts and terms are defined and the
relationships between these concepts and ontology automates information processing and can
facilitate text mining in a domain. Thus, Ontology is a vocabulary for metadata that machines
can understand which helps to automate the information processing. So, the ontology based text
content mining considers only similar terms without considering the meaning of the terms.
With regard to, ontology based content mining any dissimilar different data wound be
consider as the different documents unless some semantic meaning is added to it. If we add some
semantic meaning, then both documents can be related with each other under the category of
particular domain.
The following derivation used to execute the text term (T) based processing for efficient
clustering. If we consider the dataset (D) contains many documents (n) which is represented as
D = {Dj; j = 1 ... nD}
And its term represented as the,
International Journal of Pure and Applied Mathematics Special Issue
2639
T = {t i; i = 1...nT}
Here input has been given to the cluster without irrelevant term of websites. The combination
of distribution-based measure and behavioral characteristics are calculated based on the distance
measure is used as similarity measure calculation which is improve the overall clustering
performance.
To calculating the document distance measure, from the below equation, produces the
following vector V*. V*(D) represents the document description and is a set of terms where each
term “t” is associated with its normalized term frequency “tf”. Thus, the all the element of vector
V*(Di) can be calculated using Equation.
∑
Where tfk,i is the frequency of the Kth
term in text Di. Thus V* represents a text as a vector
using term frequencies to set weights associated to each element. The “i” represents the term
occurs in document Di, The distance between a pair of input text (f)
(Di, Dj) is calculated using
Equation and is represented as,
( ) [∑|
|
]
In this proposed model, here "p” is phrase in the web content mining and therefore actually
implements Manhattan distance metric the second vector V** takes into consideration the
structural properties of a text and is represented as a set of probability distributions associated
with the term vector. Here, each term t(T) occurring in a dataset D is associated with a
distribution function that gives the spatial probability density function (PDF) of “t” in D. Such a
distribution, pt, u(s) is generated under the hypothesis detecting the Kth
occurrence of a term “t”
at the normalized position Sk [0,1] in the text, the spatial pdf of the term can be approximated by
a Gaussian distribution centered around sk. In other words, if the term tj is found at position sk
within a document, a second dataset with similar structure is expected to include the same term
at the same position or in a neighborhood thereof, with a probability defined by a Gaussian pdf.
To derive a formal expression of the pdf, assume that the ith
text, Di, holds no occurrences of
terms after simplifications.
The spatial pdf is defined using Equation.
∑
International Journal of Pure and Applied Mathematics Special Issue
2640
Where A and are normalization terms, G is the Gaussian pdf given by Equation
√ [
]
From this the second term vector V** is calculated by considering a discrete approximation
of Equation. Here, the dataset D is segmented evenly into S sections, from which S-dimensional
vectors are generated for each term t T. Each element estimates the probability of a term „t‟
occurring in the corresponding section of the text. Thus, v** (D) is represented as an array of nT
vectors having dimension S. The distance between the probability vectors thus created (V**) is
calculated by using Euclidean metric (Equation 4.7) and is represented as (b) for two documents
Di and Dj.
( ) ∑ ( )
∑∑[
]
From the calculated Δ(f) and Δ(b), the final distance is calculated using Equation.
( ) ( ) ( )
Here [0, 1] is the mixing coefficient weight. For the last stage, that is the actual
clustering,
To find the relations between individual terms and its objects of sentences even complex
structure of sentences the ontology combined frequent itemset clustering approach facilitates the
implementation of knowledge about hierarchical structure of categories. The proposed web
content mining combines the concept clustering algorithms.
The following steps are used to find the Hybrid schemes.
Step 1: select an appropriate group of keywords based on the proper noun and proper verb on
the corpus.
Step 2: Depending upon the weights of the input text cluster the retrieved web content is
arranged and for each website which is called weighting scheme based on ontology.
Step 3: Finally, Ontology tree structure is formed based on the explanation and formulating it
in a particular domain. This is useful for interoperability between these ontologies.
It identifies irrelevant or redundant features for achieve accurate text clustering.
International Journal of Pure and Applied Mathematics Special Issue
2641
V. Example Illustration
The following example illustrates the dependency graph and ontology. it has the input from
the three-different text such as
Text 1: extra, skater
Text 2: Hockey, reference
Text 3: women, hockey, stats
Above three text are taken in to the element of the formation of ontology moreover the
depending upon the noun and verb the selected words are underlined as the below set of web
sites.
Web site 1: http://www.extraskater.com/
Web site 2: https://www.hockey-reference.com/
Web site 3: http://womenshockeystats.com/?nr=0
To make clear validity of the ontology some small text is selected form the total words
depends on the noun and verb. These words are useful to make the ontology tree other than these
tree elements are unknown, because these elements have no entry in the lexicon so it was
reduced for the web content mining. To satisfy this contradiction, we go for Ontology
specification of conceptualizations.
This will have formed as the following dependency graph,
Figure 4. Resulted hierarchy after processing the Site URLs
International Journal of Pure and Applied Mathematics Special Issue
2642
From the above dependency graph initially 300 websites from the manually created sports data
corpus. The calculated dependency graph is used to find the reduction for each text set in the data
corpus. Finally, the reduction rate is increased in all input. This result can be seen that the
process of constructing dependency graph was able to solve the sparsity (impact processing time)
problem by reducing the number of features dependency graph based on the ontology.
VI. Empirical Evaluation
6.1. Datasets and Setup
Sports dataset has a collection of real-world sports stories in the English language under different
categories. Totally Sports dataset has many data, each moderately consisting of different
websites. The citation and detail about all entries are available for each data which includes Date,
Topics, Title, and content part.
6.2. Experimental Results
The performance measures of precision, recall, F-measure are calculated based on the
following calculation depending upon the web content mining scenario such as Front end of
Visual basic. These are the input data for web content mining raw data and the cleaned data
section.
Figure 5.web content mining raw data and the cleaned data
Moreover, My SQL connector is used to the setting up saved the ontology history and
retrieval process. The performance measures are calculated depending upon the following
equations,
International Journal of Pure and Applied Mathematics Special Issue
2643
Figure 6. The ontology history and retrieval process
The above graph shows the sports dataset elements. This sports dataset elements are used in
this research to makes text mining and web sites classification as the effective and efficient way.
The following graph shows the precision and recall chart of the extracted above rules and its
measures.
International Journal of Pure and Applied Mathematics Special Issue
2644
Figure 7. Precision and recall comparison
The above represented graph of figure 7 gives the precision and recall individually depending
upon dataset corpus and useful to analyze the F- measure [5].
Figure 8. the precision, recall, F-Measure
The above figure 8 shows the various performance measures of the web content mining
depending upon the dataset corpus with respect to the hybrid schemes combining ontology and
frequent item clustering techniques results. It also compares the performance measure of
precision, recall, f-measure.
Constructing the web content retrieval framework based on the ontology can reduce the memory
size with respect to frequent item clustering. Moreover, the semantic problems as well through
extracting only the meaningful sentences are also performed. In figure 4 shows that they
represent dependency relations between the Site URLs as well as display the whole sentence
meaning in the selected verb and noun on the dependency graph. The final result deals with the
separate sentences are labeled with the grammatical function as well as mono meaningful words.
In contrast to phrase structure grammar therefore, dependency grammars can be used to directly
express grammatical functions as a type of dependency graph with the aim of improving the text
mining results.
International Journal of Pure and Applied Mathematics Special Issue
2645
VII. Conclusion
In web content mining world, the eventual goal is to help users to find and list the appropriate
websites they need in a fast and accurate manner depends upon the keyword. Let us note that a
very wide range of information is available in the World Wide Web. The extracting the web
results on World Wide Web be treated by Web crawler. It has been of successful use in
applications concerning web pages, multi-threaded downloader, URL scheduler, text and meta-
data storage, web data and more. The usage of web content mining techniques on text documents
are here known as intelligent web content analysis. Especially in the field of Internet
accessibility and web content mining are increasing the magnitude and usage of web crawler
based on the web content mining. The Keyword based and URL based Links are extracted from
websites and move forward on crawl. Web content mining approach is the process of similarity
among the source ontologies is calculates with the aim of defining the order in which they will
be merging. This approach is to discover knowledge from websites is taking as the discussion of
effective methods like discovers relevant webpages, “keyword and url based links extraction”. In
this research work, Web content mining approach and discovered patterns used to find accurate
relevant features in a World Wide Web. Based on the keyword, attributes, synonym, homonym
with respect to the “web content mining” and “clustering based mining” are well derived
specification for the web content mining. In such a way, this research is efficient in terms of web
content mining. Because it has efficient result in duplication occurrence, knowledge discovery,
works in dynamic web page content, automatic topic extraction, Fast information retrieval.
Moreover, to all those processes are made easy and efficient and its results proving ontology
with web content mining is promising and improves web content mining process. Finally, our
research technology provides the “mining accuracy”, “complexity” and “time space” is improves
for quality of the content mining results problem on extraction process.
References
1. Christopher Fox. A stop list for general text. In ACM SIGIR Forum, volume 24, pages
19{21. ACM, 1989.
2. Paice Chris D. “An evaluation method for stemming algorithms”. Proceedings of the 17th
annual international ACM SIGIR conference on Research and development in
information retrieval. 1994, 42- 50.
3. P. Bradley, U. Fayyad, and C. Reina, “Scaling clustering algorithms to large databases, ”
In Proc. of KDD-1998, New York, NY, USA, August 1998, pages 9–15, Menlo Park,
CA, USA, 1998. AAAI Press.
4. Berry Michael W., (2004), “Automatic Discovery of Similar Words”, in “Survey of Text
Mining: Clustering, Classification and Retrieval”, Springer Verlag, New York, LLC, 24-
43.
International Journal of Pure and Applied Mathematics Special Issue
2646
5. Zhilong Zhen, Haijuan Wang, Lixin Han, Zhan Shi, “Categorical Document Frequency
Based Feature Selection for Text Categorization”, IEEE International Conference of
Information Technology, Computer Engineering and Management Sciences, IEEE
computer society.2011.
6. Gui-xian Xu, Li-rong Qiu, Lu Yang, “Tibetan Text Clustering Based on Machine
Learning”, Journal of Digital Information Management, Volume 12, Number 3,June
2014.
7. P Jaganathan, S Jaiganesh, “An improved K-means algorithm combined with Particle
Swarm Optimization approach for efficient web document clustering”, IEEE
International Conference on Green Computing, Communication and Conservation of
Energy (ICGCE), pp 772-776, 2013.
8. Badal, Shruti Tripathi, “Frequent Data Itemset Mining Using VS_Apriori Algorithms”,
International Journal on Computer Science and Engineering Vol. 02, No. 04, 1111-1118,
2010
9. O.Jamsheela, Raju.G, “Frequent Itemset Mining Algorithms: A Literature Survey”, IEEE
International Advance Computing Conference (IACC), pp: 1099 – 1104, 2015.
10. Ferenc Kov´acs, J´anos Ill´es, “Frequent Itemset Mining on Hadoop”, IEEE 9th
International Conference on Computational Cybernetics, pp: 241-245, July 2013.
11. Twinkle S vadasa, Jasmine Jhab, “A Literature Survey on Text Document Clustering and
Ontology based techniques”, International Journal of Innovative and Emerging Research
in Engineering Volume 1, Issue 2, 2014.
12. Tingting Wei, Yonghe Lu, Huiyou Chang, Qiang Zhou, Xianyu Bao, “A semantic
approach for text clustering using WordNet and lexical chains”, Published by Elsevier
Ltd.journal homepage: www.elsevier.com/locate/eswa.
13. Allimuthu, U.: BAU FAM: biometric-blacklisting anonymous users using fictitious and
adroit manager. J. Adv. Res. Dyn. Control Syst. (12-Special), 722–737 (2017)
14. Ragunath, Sivaranjani, “Ontology Based Text Document Summarization System Using
Concept Terms”, ARPN Journal of Engineering and Applied Sciences, VOL. 10, NO. 6,
APRIL 2015
15. Anbarasi, Iswarya, Sindhuja, Yogabindiya, “ONTOLOGY ORIENTED CONCEPT
BASED CLUSTERING”, IJRET: International Journal of Research in Engineering and
Technology, Volume: 03 Issue: 02, Feb-2014.
16. J. Paralic and I. Kostial, "Ontology-based Information Retrieval," Pmc: of Ihp 141h Int.
Co'lf. on Inform. and intelligent Syst. (liS 2003), Varazdin, Croatia, pp. 23-28 , 2003
17. Lei ZHANG, Zhichao WANG, Ontology-based Clustering Algorithm with Feature
Weights, “”, pp: 2959- 2966, 2010.
International Journal of Pure and Applied Mathematics Special Issue
2647
18. Bloehdorn, Cimiano, Hotho, Staab, “An Ontology-based Framework for Text Mining”,
July 2004.
19. Hmway Hmway Tar, Thi Thi Soe Nyunt, “Ontology-Based Concept Weighting for Text
Documents”, International Conference on Information Communication and Management
IPCSIT vol.16, IACSIT Press, Singapore, 2011
20. L.; Feng Luo, "Ontology construction for information selection," in Tools with Artificial
Intelligence, 2002. (ICTAI 2002). Proceedings. 14th IEEE International Conference on ,
vol., no., pp.122-127, 2002
21. V. Srividhya, R. Anitha, “Evaluating Preprocessing Techniques in Text Categor zation”,
International Journal of Computer Science and Application Issue 2010.
22. Sirsat S. R, Vinay Chavan, mahalle .H.S, “Strength and Accuracy Analysis of Affix
Removal Stemming Algorithms”, International Journal of Computer Science and
Information Technologies, Vol. 4 (2), 265 – 269, 2013.
23. M.Thangarasu, Dr.R.Manavalan “A Literature Review: Stemming Algorithms for Indian
Languages” International Journal of Computer Trends and Technology (IJCTT) – volume
4 Issue 8–August 2013 ISSN: 2231-2803 Page 2582-2584
24. Julie Beth Lovins “Development of a Stemming Algorithm”, Mechanical Translation and
Computational Linguistics, vol.11, nos.1 and 2, March and June 1968.
25. Levent ÄOzgÄur and Tunga GÄungÄor “Analysis of Stemming Alternatives and
Dependency Pattern Support in Text Classification”, Advances in Computational
Linguistics Research in Computing Science, 2009, pp. 195-206, 2009.
26. Boulis, C. and Ostendorf, M. (2005). Text classification by augmenting the bag-of-words
representation with redundancy compensated bigrams. In Proceedings of the International
Workshop on Feature Selection in Data Mining, in conjunction with SIAM SDM-, pages
9-16, 2005
International Journal of Pure and Applied Mathematics Special Issue
2648
2649
2650