TEXT Automatic Template Extraction From Heterogeneous Web Pages

download TEXT Automatic Template Extraction From Heterogeneous Web Pages

of 15

description

text

Transcript of TEXT Automatic Template Extraction From Heterogeneous Web Pages

  • TEXT: Automatic Template Extractionfrom Heterogeneous Web Pages

    Chulyun Kim and Kyuseok Shim, Member, IEEE

    AbstractWorld Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages

    in many websites are automatically populated by using the common templates with contents. The templates provide readers easy

    access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they

    degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques

    have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web

    documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are

    generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in

    the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast

    approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets

    confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms.

    Index TermsTemplate extraction, clustering, minimum description length principle, MinHash.

    1 INTRODUCTION

    WORLD Wide Web (WWW) is widely used to publishand access information on the Internet. In order toachieve high productivity of publishing, the webpages inmany websites are automatically populated by usingcommon templates with contents. For human beings, thetemplates provide readers easy access to the contentsguided by consistent structures even though the templatesare not explicitly announced. However, for machines, theunknown templates are considered harmful because theydegrade the accuracy and performance due to the irrelevantterms in templates. Thus, template detection and extractiontechniques have received a lot of attention recently toimprove the performance of web applications, such as dataintegration, search engines, classification of web docu-ments, and so on [3], [4], [12], [14], [15], [23]. For example,biogene data are published on the Internet by manyorganizations with different formats and scientists want tointegrate these data into a unified database. For pricecomparison services, the price information is gathered fromvarious Internet marketplaces. Good template extractiontechnologies can significantly improve the performance ofthese applications.

    The problem of extracting a template from the web

    documents conforming to a common template has been

    studied in [3], [10], [22]. Due to the assumption of all

    documents being generated from a single common tem-

    plate, solutions for this problem are applicable only when

    all documents are guaranteed to conform to a common

    template. However, in real applications, it is not trivial toclassify massively crawled documents into homogeneouspartitions in order to use these techniques. Since subtlechanges in scripts or CGI parameters may result in asignificant difference[12], [26], we cannot simply group theweb documents by URL and apply these methods for eachgroup separately. In Fig. 1, given two pages look clearlydifferent. However, their URLs are identical except thevalue of a layout parameter. If we use only URLs to grouppages, these pages from the different templates will beincluded in the same cluster.

    To overcome the limitation of the techniques with theassumption that the web documents are from a singletemplate, the problem of extracting the templates from acollection of heterogenous web documents, which aregenerated from multiple templates, was also studied. Inthis problem, clustering of web documents such that thedocuments in the same group belong to the same templateis required, and thus, the correctness of extracted templatesdepends on the quality of clustering.

    Since an HTML document can be naturally representedwith a Document Object Model (DOM) tree, web docu-ments are considered as trees and many existing similaritymeasures for trees have been investigated for clustering[12], [22], [26]. However, clustering is very expensive withtree-related distance measures. For instance, tree-editdistance has at least On1n2 time complexity [12], [22],where n1 and n2 are the sizes of two DOM trees and thesizes of the trees are usually more than a thousand. Thus,clustering on sampled web documents is used to practicallyhandle a large number of web documents.

    Reis et al. [12] presented a method in which a smallnumber of sampled documents are clustered first, and then,the other documents are classified to the closest clusters. Inboth clustering and classifying, a restricted tree-edit distanceis used to measure the similarity between documents.

    612 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011

    . The authors are with the School of Electrical Engineering and ComputerScience, Seoul National University, Gwanak-gu, 151-744 Seoul, Korea.E-mail: {cykim, shim}@kdd.snu.ac.kr.

    Manuscript received 20 July 2009; revised 4 Nov. 2009; accepted 30 Dec.2009; published online 19 Aug. 2010.Recommended for acceptance by J. Yang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2009-07-0562.Digital Object Identifier no. 10.1109/TKDE.2010.140.

    1041-4347/11/$26.00 2011 IEEE Published by the IEEE Computer Society

  • However, it is not easy to select proper training data of smallsize, since we do not have any knowledge about given datain advance. Moreover, it is hard to decide howmany clustersare to be generated from the web documents. In [12], theyempirically suggested 80 percent similarity threshold withina cluster but it does not work for all the time. In [26], it isassumed that labeled training data are given for clustering.However, this assumption is not valid either in many cases.

    In this paper, in order to alleviate the limitations of thestate-of-the-art technologies, we investigate the problem ofdetecting the templates from heterogeneous web documentsand present novel algorithms called TEXT (auTomatictEmplate eXTraction). We propose to represent a webdocument and a template as a set of paths in a DOM tree.As validated by the most popular XML query languageXPATH [2], paths are sufficient to express tree structures anduseful to be queried. By considering only paths, the overheadto measure the similarity between documents becomes smallwithout significant loss of information. For example, let usconsider simple HTML documents and paths in Fig. 2 andTable 1.Wewill formally define the paths later. Document d1is represented as a set of paths fp1; p2; p3; p4; p6g and thetemplate of both d1 and d2 is another set of pathsfp1; p2; p3; p4g.

    Our goal is to manage an unknown number of templatesand to improve the efficiency and scalability of templatedetection and extraction algorithms. To deal with theunknown number of templates and select good partitioningfrom all possible partitions of web documents, we employRissanens MinimumDescription Length (MDL) principle in[20], [21]. Intuitively, each candidate partitioning (i.e.,clustering) is ranked according to the number of bits requiredto describe a clustering model and the partitioning with theminimum number of bits is selected as the best one. In ourproblem, after clustering documents based on the MDLprinciple, themodel of each cluster is the template itself of theweb documents belonging to the cluster. Thus, we do notneed additional template extraction process after clustering.

    In order to improve efficiency and scalability to handle alarge number of web documents for clustering, we extend

    MinHash [7]. While the traditional MinHash is used toestimate the Jaccard coefficient between sets, we propose anextended MinHash to estimate our MDL cost measure withpartial information of documents. Moreover, our proposedalgorithms are fully automated and robust without requir-ing many parameters. Experimental results with real lifedata sets confirm the effectiveness of our algorithms.

    In summary, our contributions are as follows:

    . We apply the MDL principle to our problem toeffectively manage an unknown number of clusters(i.e., an unknown number of templates).

    . In our method, document clustering and templateextraction are done together at once. The MDL cost isthe number of bits required to describe data with amodel and the model in our problem is thedescription of clusters represented by templates.

    . Since a large number of web documents aremassively crawled from the web, the scalability oftemplate extraction algorithms is very important tobe used practically. Thus, we extend MinHashtechnique to estimate the MDL cost quickly, so thata large number of documents can be processed.

    . Experimental results with real life data sets up to15 GB confirmed the effectiveness and scalability ofour algorithms. Our solution is much faster thanrelated work and shows significantly better accuracy.

    2 RELATED WORK

    The template extraction problem can be categorized intotwo broad areas. The first area is the site-level templatedetection where the template is decided based on severalpages from the same site. Crescenzi et al. [10] studiedinitially the data extraction problem and Yossef andRajagopalan [4] introduced the template detection problem.Previously, only tags were considered to find templates butArasu and Garcia-Molina [3] observed that any word can bea part of the template or contents. We also adopt thisobservation and consider every word equally in oursolution. However, they detect elements of a template bythe frequencies of words but we consider the MDL principleas well as the frequencies to decide templates fromheterogenous documents. Vieira et al. [22] suggested analgorithm considering documents as trees but the opera-tions on trees are usually too expensive to be applied to alarge number of documents. Zhao et al. [24], [25] concen-trated on the problem of extracting result records fromsearch engines. For XML documents, Garofalakis et al. [14]

    KIM AND SHIM: TEXT: AUTOMATIC TEMPLATE EXTRACTION FROM HETEROGENEOUS WEB PAGES 613

    Fig. 2. Simple web documents. (a) Document d1. (b) Document d2.(c) Document d3. (d) Document d4.

    TABLE 1Paths of Tokens and Their Supports

    Fig. 1. Different templates of the same URL.

  • solved the problem of DTD extraction from multiple XMLdocuments. While HTML documents are semistructured,XML documents are well structured, and all the tags arealways a part of a template. The solutions for XMLdocuments fully utilize these properties. In the problem ofthe template extraction from heterogeneous document, howto partition given documents into homogeneous subsets isimportant. Reis et al. [12] used a restricted tree-edit distanceto cluster documents and, in [26], it is assumed that labeledtraining data are given for clustering. However, the tree-edit distance is expensive and it is not easy to select goodtraining pages. Crescenzi et al. [11] focused on documentclustering without template extraction. They targeted a siteconsisting of multiple templates. From a seed page, webpages are crawled by following internal links and the pagesare compared by only their link information. However, ifweb pages are collected without considering their method,pages from various sites are mixed in the collection andtheir algorithm should repeatedly be executed for each site.Since pages crawled from a site can be different by theobjectivity of each crawler, their algorithm may requireadditional crawling on the fly.

    The other area is the page-level template detection wherethe template is computed within a single document.Lerman et al. [16] proposed systems to identify data recordsin a document and extract data items from them. Zhai andLiu [23] proposed an algorithm to extract a template usingnot only structural information, but also visual layoutinformation. Chakrabarti et al. [6] solved this problem byusing an isotonic smoothing score assigned by a classifier.Since the problem formulation of this area is far from ours,we do not discuss it in detail.

    Our algorithms to be presented later represent webdocuments as a matrix and find clusters with the matrix.Biclustering or coclustering is another clustering techniqueto deal with a matrix [13], [17], [18]. Coclustering algorithmsfind simultaneous clustering of the rows and columns of amatrix and require the numbers of clusters of columns androws as input parameters. However, we cluster onlydocuments not paths, and moreover, the numbers ofclusters of columns and rows are unknown.

    3 PRELIMINARIES

    3.1 HTML Documents and Document Object Model

    The DOM defines a standard for accessing documents, likeHTML and XML [1]. The DOM presents an HTMLdocument as a tree structure. The entire document is adocument node, every HTML element is an element node,the texts in the HTML elements are text nodes, every HTMLattribute is an attribute node, and comments are commentnodes. However, we do not distinguish the type of nodes,since, as defined in [3], any type of node can be a part of atemplate in our problem. For instance, the DOM tree of asimple HTML document d2 in Fig. 2b is given in Fig. 3. For anode in a DOM tree, we denote the path of the node bylisting nodes from the root to the node in which we use nas a delimiter between nodes. For example, in the DOM treeof d2 in Fig. 3, the path of a node World is Doc-umentnh html inh body inh h1inWorld.

    3.2 Essential Paths and Templates

    Given a web document collection D fd1; d2; . . . ; dng, wedefine a path set PD as the set of all paths in D. Note that,since the document node is a virtual node shared by everydocument, we do not consider the path of the documentnode in PD. The support of a path is defined as the numberof documents in D, which contain the path. For eachdocument di, we provide a minimum support threshold tdi .Notice that the thresholds tdi and tdj of two distinctdocuments di and dj, respectively, may be different. If apath is contained by a document di and the support of thepath is at least the given minimum support threshold tdi ,the path is called an essential path of di. We denote the set ofessential paths of an HTML document di by Edi. For aweb document set D with its path set PD, we use a jPDj jDj matrix ME with 0/1 values to represent the documentswith their essential paths. The value at a cell i; j in thematrixME is 1 if a path pi is an essential path of a documentdj. Otherwise, it is 0.

    Example 1. Consider the HTML documents D fd1; d2; d3; d4g in Fig. 2. All the paths and their frequenciesin D are shown in Table 1. Assume that the minimumsupport thresholds td1 , td2 , td3 , and td4 are 3, 3, 3, and 4,respectively. The essential path sets are Ed1 fp1; p2; p3; p4g, Ed2 fp1; p2; p3; p4; p5g, Ed3 fp1; p2;p3; p4; p5g, and Ed4 fp1; p2g. We have the path setPD fpij1 i 8g and the matrix ME becomes asfollows:

    ME

    d1 d2 d3 d4p1 1 1 1 1p2 1 1 1 1p3 1 1 1 0p4 1 1 1 0p5 0 1 1 0p6 0 0 0 0p7 0 0 0 0p8 0 0 0 0

    26666666666664

    37777777777775:

    We next discuss how to determine the proper minimumsupport threshold of each document. The goal of introdu-cing essential paths is to prune the paths away in advancewhich cannot be a part of any template. It is a kind ofpreprocessing to improve the correctness of clustering. Ifwe use the same threshold for all pages, it is not reasonablebecause the number of documents generated by eachtemplate is not the same. Thus, we may need to use adifferent threshold for each page.

    The template of a document cluster is a set of paths whichcommonly appear in the documents of the cluster. If a pathis contained in most pages of the cluster, we can assume thatthe occurrence of the path is not probably by chance, and

    614 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011

    Fig. 3. DOM Tree of d2 in Fig. 2.

  • thus, the path should be considered as a part of the template.Contents are the paths which are not members of thetemplate. If a document is generated by a template, thedocument contains two types of paths: the paths belongingto the template and the paths belonging to the contents. Toseparate the paths in contents from the paths in the template,we assume that 1) the support of a path in a template isgenerally higher than that of a path in contents and 2) thenumber of the paths belonging to the template is generallygreater than that of paths belonging to the contents. For thefirst assumption, the paths in a template are shared by thedocuments generated by the template but those in contentsare usually unique in each document. Thus, the support ofthe former is higher than that of the latter. For the secondassumption, the paths from the template are typicallydominant in a document. For example, a snippet given inFig. 4 consists of 42 nodes including attributes in the DOMtree. Among them, only nine nodes are contents and theothers are from the template.

    Based on our assumption, we found empirically that themode of support values (i.e., the most frequent supportvalue) of paths in each document is very effective to maketemplates survive, while contents are eliminated. Therefore,in this paper, we use the mode of support values of paths ineach document as the minimum support threshold for eachdocument. If there are several modes of support values, wewill take the smallest mode.

    Example 2. In Fig. 2 and Table 1, the paths appearing at thedocument d2 are p1, p2, p3, p4, p5, and p7 whose supportsare 4, 4, 3, 3, 3, and 1, respectively. Since 3 is the mode ofthem, we use 3 as the minimum support threshold valuetd2 . Then, p1, p2, p3, p4, and p5 are essential paths of d2.

    3.3 Matrix Representation of Clustering

    We next illustrate the representation of a clustering of webdocuments. Let us assume that we have m clusters such asC fc1, c2; . . . cmg for a web document set D. A cluster ci isdenoted by a pair Ti;Di, where Ti is a set of pathsrepresenting the template of ci and Di is a set of documentsbelonging to ci. In our clusteringmodel,we allowadocumentto be included in a single cluster only. That is, we haveDiTDj ; for all distinct clusters ci, cj, and

    S1im Di D.

    In addition, we define Ei for a cluster ci asS

    dk2Di Edk.To represent a clustering information C fc1; c2; . . . cmg

    for D, we use a pair of matrices MT and MD, where MTrepresents the information of each cluster with its templatepaths and MD denotes the information of each cluster withits member documents. If the value at a cell i; j inMT is 1,it means that a path pi is a template path of a cluster cj (i.e.,pi 2 Tj). Otherwise, pi does not belong to the template pathsof cj (i.e., pi 62 Tj). Similarly, the value at a cell i; j inMD is1 if a document dj belongs to a cluster ci (i.e., dj 2 Di).Regardless of the number of clusters, we fix the dimensionof MT as jPDj jDj and that of MD as jDj jDj. Columns

    and rows in MT and MD exceeding the number of clustersare filled with zeros. In other words, for a clustering withC fc1; c2; . . . cmg, all values from (m 1)th to jDjthcolumns in MT are zeros, and all values from (m 1)th tojDjth rows in MD are zeros.

    We will represent ME by the product of MT and MD.However, the product of MT and MD does not alwaysbecome ME . Thus, we reconstruct ME by adding adifference matrix M with 0=1=1 values to MT MD, i.e.,ME MT MD M.Example 3. Consider the web documents in Fig. 2 and ME

    in Example 1 again. Assume that we have a clusteringC fc1; c2g, where c1 fp1; p2; p3; p4; p5g; fd1; d2; d3gand c2 fp1; p2g; fd4g. Then, MT , MD, and M are asfollows and we can see that ME MT MD M:

    MT

    1 1 0 0

    1 1 0 0

    1 0 0 0

    1 0 0 0

    1 0 0 0

    0 0 0 0

    0 0 0 0

    0 0 0 0

    266666666666664

    377777777777775; MD

    1 1 1 0

    0 0 0 1

    0 0 0 0

    0 0 0 0

    26664

    37775;

    M

    0 0 0 0

    0 0 0 0

    0 0 0 0

    0 0 0 0

    1 0 0 00 0 0 0

    0 0 0 0

    0 0 0 0

    266666666666664

    377777777777775:

    3.4 Minimum Description Length Principle

    In order to manage the unknown number of clusters and toselect good partitioning from all possible partitions ofHTML documents, we employ Rissanens MDL principle[20], [21]. The MDL principle states that the best modelinferred from a given set of data is the one which minimizesthe sum of 1) the length of the model, in bits, and 2) thelength of encoding of the data, in bits, when described withthe help of the model. We refer to the above sum for amodel as the MDL cost of the model.

    In our setting, the model is a clustering C, which isdescribed by partitions of documents with their templatepaths (i.e., the matrices MT and MD), and the encoding ofdata is thematrixM. TheMDL costs of a clusteringmodelCand a matrixM are denoted as LC and LM, respectively.Considering the values in a matrix as a random variable X,Pr1 and Pr1 are the probabilities of 1 s and 1 s in thematrix and Pr0 is that of zeros. Then, the entropy HX ofthe random variable X [9], [19] is as follows:

    HX X

    x2f1;0;1gPrx log2 Prx and

    LM jMj HX:

    KIM AND SHIM: TEXT: AUTOMATIC TEMPLATE EXTRACTION FROM HETEROGENEOUS WEB PAGES 615

    Fig. 4. A snippet of search results by Google.

  • The MDL costs of MT and M (i.e., LMT and LM)are calculated by the above formula. ForMD, we use anothermethod to calculate its MDL cost. The reason is that therandom variable X in MD is not mutually independent,since we allow a document to be included in a single clusteronly (i.e., each column has only a single value of 1). Thus, weencode MD by jDj number of cluster IDs. Since the numberof bits to represent a cluster ID is log2 jDj, the total number ofbits to encode MD (i.e., LMD) becomes jDj log2 jDj. Then,the MDL cost of a clustering model C is defined as the sumof those of three matrices (i.e., LC LMT LMD LM). According to the MDL principle, for two clusteringmodels C MT;MD and C0 M 0T ;M 0D, we say that C isa better clustering than C0 if LC is less than LC0.Example 4. Reconsider the clustering C in Example 3.

    Then, with MT in Example 3, Pr1 732 and Pr0 2532and we have LMT jMT j HX 32 732 log2 732 2532 log2

    2532 24:25. Similarly, LMD 8, LM 6:42,

    and thus, LC 38:67. For another clustering C0, let usassume that M 0T , M

    0D, and M

    0 are as follows:

    M 0T

    1 0 0 0

    1 0 0 0

    1 0 0 0

    1 0 0 0

    0 0 0 0

    0 0 0 0

    0 0 0 0

    0 0 0 0

    266666666666664

    377777777777775; M 0D

    1 1 1 1

    0 0 0 0

    0 0 0 0

    0 0 0 0

    26664

    37775;

    M 0

    0 0 0 0

    0 0 0 0

    0 0 0 10 0 0 10 1 1 0

    0 0 0 0

    0 0 0 0

    0 0 0 0

    266666666666664

    377777777777775:

    Then, LMT 17:39, LMD 8, LM 21:39, andthus, LC0 46:78. Since LC < LC0, we say C is abetter clustering than C0. It is natural to think that d1, d2,and d3 are generated by the same template and d4 looksdifferent from the others. Thus, the clustering C group-ing d1, d2, and d3 together and isolating d4 is better thanthe other clustering C0 grouping d1, d2, d3, and d4altogether. Thus, we can see that it is intuitivelyreasonable to prefer C to C0.

    3.5 Problem Formulation

    The formal problem formulation is as follows:

    Problem 1. Given a web document set D and its essential pathmatrixME , find the best clustering model C withMT andMDto minimize the MDL cost LC.

    In the remainder of this paper, we shall investigate howwe can cluster a number of web documents to minimize theMDL cost.

    4 CLUSTERING WITH MDL COST

    4.1 Agglomerative Clustering Algorithm

    Our clustering algorithm TEXT-MDL is presented in Fig. 5.The input parameter is a set of documents D fd1; . . . ; dng,where di is the ith document. The output result is a set ofclusters C fc1; . . . ; cmg, where ci is a cluster representedby the template paths Ti and the member documents Di(i.e., ci Ti;Di). A clustering model C is denoted by twomatrices MT and MD and the goodness measure of theclustering C is the MDL cost LC, which is the sum ofLMT , LMD, and LM.

    TEXT-MDL is an agglomerative hierarchical clusteringalgorithm which starts with each input document as anindividual cluster (in line 1). When a pair of clusters ismerged, the MDL cost of the clustering model can bereduced or increased. The procedure GetBestPair finds apair of clusters whose reduction of the MDL cost ismaximal in each step of merging and the pair is repeatedlymerged until any reduction is not possible. In order tocalculate the MDL cost when each possible pair of clustersis merged, the procedure GetMDLCost(ci, cj, C), where ciand cj are a pair to be merged and C is the currentclustering, is called in GetBestPair and C is updated bymerging the best pair of clusters.

    As we will discuss later in detail, because the scale ofthe MDL cost reduction by merging a pair of clusters isaffected by all the other clusters, GetBestPair shouldrecalculate the MDL cost reduction of every pair at eachiteration of while loop in line 7. Furthermore, the complex-ity of GetMDLCost is exponential on the size of the templateof a cluster. Since it is not practical to use TEXT-MDL with anumber of web documents, we will introduce an approx-imate MDL cost model and use MinHash to significantlyreduce the time complexity.

    616 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011

    Fig. 5. The TEXT-MDL algorithm.

  • 4.2 Computation of Optimal MDL Cost

    4.2.1 Optimal Template Paths of Clusters

    As mentioned previously, we represent a clustering modelC by two matrices MT and MD and the MDL cost LC isthe sum of LMT , LMD, and LM. Let us start bydiscussing the independence of LMD from LC and thesparsity of ME .

    Independence of LMD from LC. Because LMD isconstant for every MD as jDj log2 jDj, the value of LC isnot affected by that of LMD. Thus, minimizing LC is thesame as minimizing the sum of LMT and LM only.

    Sparsity of ME . Since web documents are made bydifferent templates of various sites, the web documentsseldom have common paths. In shallow depths, some pathscan commonly occur in heterogeneous documents since thekinds of tags,which can be placed at the first or seconddepth,are limited. However, as the depth is deepened, thepossibility that a path appears commonly in heterogeneousdocuments is decreased exponentially. Thus, in the rest of thepaper, we assume that the matrices, such as ME , are sparse(i.e., zero is more frequent than other values in a matrix). Ifthis assumption does not hold in an extreme case, we can addempty documents asmany as the number of documents inD.Then, the empty documents are represented by only zeros inME and zeros inME become more than a half ofME .

    Candidacy of template paths. For a cluster ci Ti, Diof a clustering model C, only the essential paths ofdocuments in Di can be included in the optimal templatepaths Ti to minimize the MDL cost of C, as shown in thefollowing lemma:

    Lemma 1. For a cluster ci Ti;Di of a clustering model C, ifTi is the optimal template of ci, then, to minimize LC, Timust be a subset of Ei

    Sdk2Di Edk.

    Proof. Assume that the optimal template Ti to minimizeLC contains a path pk which is not included in Ei.Then, in the product of MT and MD (i.e., MT MD), thekth row of the columns corresponding to Di is filled with1s. However, since pk does not appear in any documentin Di, the kth row of the columns corresponding to Di inM should be filled with 1s.

    Now let us exclude pk from Ti. Then, inMT , the numberof 1s decreases by one, and thus,Pr1decreases. Sinceweassume the sparsity of the matrices (i.e., Pr1 Pr0),HX onMT will decrease. Now, inMT MD, the kth rowof the columns corresponding toDi is changed as 0s. Then,1 s in the kth row of the columns corresponding toDi inM should be replaced with 0s. Thus, the number of 1sinM will decrease by jDij and that of 0s will increase byjDij. It results in a smaller value of Pr1, a larger valueof Pr0, and the same value of Pr1 in M. Therefore,HX on M decreases as well. Since HXs on both MTand M decrease, LC decreases as well. It is contra-dictory to the assumption that Ti containing pk is theoptimal template of ci to minimize LC. Thus, we canconclude that Ti contains only the essential paths of thedocuments in Di (i.e., Ti Ei). tu

    Decision of optimal template paths. According toLemma 1, the optimal template Ti is a subset of theessential path set Ei

    Sdk2Di Edk, where Edk is the set

    of essential paths of a document dk in Di. However, thenumber of subsets of Ei is exponential on the size of Ei, andfurthermore, the optimal template Ti of ci depends on theother clusters. The following example shows that theoptimal template Ti of ci can be changed depending onthe template of the other cluster:

    Example 5. For ME in Example 1, consider two differenttemplate matrices MT , M

    0T , and the common MD as

    follows: Let us calculate the optimal T1 of c1 in eachtemplate matrix:

    MT

    1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    266666666666664

    377777777777775; M 0T

    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    266666666666664

    377777777777775;

    MD

    1 1 1 0

    0 0 0 1

    0 0 0 0

    0 0 0 0

    26664

    37775:

    Recall that E1 of c1 is fp1; p2; p3; p4; p5g. If we examineall subsets of E1, LC with MT is minimized when T1 fp1; p2; p3; p4; p5g but that with M 0T is minimized whenT1 fp1; p2; p3; p4g. Thus, we can see that the optimaltemplate of a cluster can be different depending on theother clusters.

    Therefore, we should consider all combinations of thesubsets of Tis in C to calculate the minimum LC whenmerging a pair of clusters and it is impractical.

    4.2.2 Approximate Optimal Template Paths

    Approximation of entropy. The dependency between theoptimal templates of clusters is caused by the fact that thetangent value at eachvalue ofHX is not constant.As shownin Fig. 6, the same difference between P Xs can result in thevarious difference between HXs. Thus, the optimaltemplate of a cluster can be different with each instance ofthe other clusters. In order to eliminate this dependency, weapproximate the entropymodel bymaking the tangent valueat everyvalue to be the same. In otherwords,weapproximatethe entropy model with 0/1 values by a straight line and theentropy model with 0=1=1 values by a plane.

    Let us define the domain and range of the approximateentropy model. In the agglomerative hierarchical clustering,

    KIM AND SHIM: TEXT: AUTOMATIC TEMPLATE EXTRACTION FROM HETEROGENEOUS WEB PAGES 617

    Fig. 6. Different HXs for the same P X.

  • each cluster has only a singlewebdocument initially. LetCinitbe the initial clustering. In Cinit,MT is the same asME ,MD isthe identity matrix I (i.e., only diagonal elements are 1s), andevery value inM is 0. Since wemerge a pair of clusters onlyif the MDL cost of clustering is reduced by merging the pair,LCinit (i.e., LME LI) is an upper bound of LC (i.e.,LMT LMD LM) of any clustering C which can beproduced by our agglomerative hierarchical algorithm. SinceLMD is constant, LMT LM LME for C, andthus,LMT LME andLM LMEhold. Let and be Pr1 andHX ofME , respectively. Then, we can derivethe following inequalities forMT andM:

    LMT or LM LME;jMT j HX of MT or jMj HXof M jME j ;

    HX of MT or HX of M :If HXs of MT and M are at most , Pr1 of MT is atmost , and Pr1 Pr1 of M is at most .

    Thus, we approximate HX of the 0/1 value matrix MTby the line from 0; 0 to ; (i.e., H 0X = Pr1).We also approximate HX of the 0=1=1matrixM by theplane on the three points 0; 0; 0, ; 0; , and 0; ; (i.e.,H 0X = Pr1 Pr1). For example, the ap-proximations with 0:2 and 0:2 log2 0:2 0:8 log2 0:8 are given in Figs. 7a and 7b, where H

    0X is theapproximate line or plane and HX is the original entropy.In summary, the MDL cost of C is approximated as

    jME j Pr1 of MT Pr1 Pr1 of MLMD

    # of 1s in MT # of 1s and1s in MLMD:

    1The following theorem shows the approximation ratio of

    LC0opt to LCopt, where Copt is the optimal clustering withthe original entropy model and C0opt is that with theapproximate entropy model:

    Theorem 1. Let be Pr1 of ME . Then, the followinginequality holds:

    LC0optLCopt 2

    2 log2 4 1 2 log2 1 2 log2 1 log2 1

    :

    Proof. Refer to the Appendix. tu

    The approximation ratio depends on Pr1 of ME . SincePr1 of ME is a constant for the given document set D, theapproximation ratio is considered as a constant for D. InFig. 7c, the approximation ratios are given for various Pr1values. For example, when Pr1 is 0.2, the approximationratio of LC0opt to LCopt is 1.576.

    Optimal template with approximate entropy. The de-pendency between the optimal templates of clusters isremoved when we calculate the MDL cost with the approx-imate entropy model. Furthermore, we do not need toexamine all the subsets of Ei of ci to decide the approximateoptimal template as shown in the following theorem:

    Theorem 2. Let suppk;Di be the number of documents in Diwhich contain a path pk as an essential path. Then, the subsetof Ei consisting of all the paths whose suppx;Di jDij12 isthe optimal Ti of ci with the approximate entropy model.

    Proof. In order to minimize the MDL cost with theapproximate entropy model, we need to minimize thenumbers of 1s and 1s in MT and M in (1).

    If a path pk is included in Ti, pk contributes a single 1in MT and jDij suppk;Di number of 1 s in Msince jDij suppk;Di number of documents in Di donot have pk as an essential path and this information isindicated as 1 in M. If a path pk is not included in Ti,pk does not contribute anything in MT , but contributessuppk;Di number of 1s inM since there are suppk;Dinumber of documents in Di containing pk as an essentialpath which is not covered by Ti. In summary, when pk isincluded in Ti, the sum of the numbers of 1s and 1s inMT and M contributed by pk is 1 jDij suppk;Di.When pk is not included in Ti, it is suppk;Di. Thus, if1 jDij suppk;Di i s less than or equal tosuppk;Di, pk is included in Ti. Otherwise, pk is not.Therefore, the optimal Ti consists of only the paths in Eiwhose suppx;Di jDij12 . tu

    Example 6. Let us reconsider Example 5. SinceD1 fd1; d2; d3g, jD1j12 2. According to Theorem 2, wejust need to get the paths whose suppx;D1 is greaterthan or equal to 2. E1 fpkj1 k 5g, suppk;D1 3for 1 k 4 and supp5; D1 2. Thus, E1 itself is theoptimal T1 of c1 with the approximate entropy model.

    GetMDLCost with the approximate entropy model isgiven in Fig. 8. The complexity of GetMDLCost is Os,where s is the average size of Ei since we just need to merge

    618 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011

    Fig. 7. Approximation of entropy and approximation ratio. (a) Approximation of entropy for 0/1 matrices. (b) Approximation of entropy for 0=1=1matrices. (c) Approximation ratio with various Pr1s in ME .

  • suppx;Dis and suppx;Djs in ci and cj once. Moreover,GetBestPair(C) in line 7 in TEXT-MDL in Fig. 5 is replacedwith GetBestPair(ck, C) in Fig. 8. Since the optimal templateof a cluster is independent of those of the other clustersowing to our approximation, we can reuse the MDL cost ofmerging each pair of clusters in the previous calls ofGetBestPair. These pairs of clusters with the MDL costs aremaintained in a heap structure and the initial best pair isretrieved from the heap (line 1). Since the complexity ofGetBestPair(ck, C) is Ons, hence that of TEXT-MDLbecomes On2s.

    5 ESTIMATION OF MDL COST WITH MINHASH

    In our problem, although we take only essential paths, thedimension of Ei is still high and the number of documentsis large. Thus, the On2s complexity of TEXT-MDL is stillexpensive. In order to alleviate this situation, we willpresent how we can estimate the MDL cost of a clusteringby MinHash not only to reduce the dimensions ofdocuments but also to find quickly the best pair to bemerged in the MinHash signature space.

    5.1 MinHash

    Jaccards coefficient between two sets S1 and S2 is defined

    as S1; S2 jS1\S2jjS1[S2j and the Min-Wise independent permu-tation is a well-known Monte Carlo technique that estimates

    the Jaccards coefficient by repeatedly assigning random

    ranks to the universal set and comparing the minimum

    values from the ranks of each set. Consider a set of random

    permutations f1; . . . ; Lg on a universal set U fr1; . . . ; rMg and a set S1 U . Let irj be the rank of rjin a permutation i and miniS1 denote minfirjjrj 2S1g. is called minwise independent if we havePrminiS1 x 1jS1j for every set S1 U andevery x 2 S1 for all i 2 . Then, for any sets S1, S2 U

    for all i 2 , we have PrminiS1 miniS2 S1; S2, where S1; S2 is the Jaccards coefficient definedpreviously.

    To estimate S1; S2 (denoted as ^S1; S2), the signaturevector of S1 is constructed as follows:

    sigS1 min1S1;min2S1; . . . ;minLS1;and similarly, sigS2 is produced for S2. The ith entry ofvector sigS1 is denoted as sigS1 i. By matching thesignatures sigS1 and sigS2 per permutation, S1; S2 canbe estimated as:

    ^S1; S2 jfijsigS1 i sigS2 igjjj : 2

    In practice, jj does not need to be big for ^S1; S2 to be agood estimate of S1; S2 [5], [7].Example 7. Consider the two sets A fr1; r3; r5g and B

    fr4; r5g where the universal set is U fr1; . . . ; r5g.Assume four random permutations f1; . . . ; 4g inFig. 9, where each column represents the permuted ranksof ri. Then, we get the signature vectors of A and B asshown in Fig. 9. Under 1, the minimum value of themapped three values for A is 1. Thus, the signature valueof A becomes 1. Similarly, the signature values of otherpermutations can be computed. Then, by counting thenumber ofmatching dimensions of the two signatures, theJaccards coefficient is estimated to be 1 out of 4 (i.e., 14 ).

    For k sets, S1; . . . ; Sk, the Jaccards coefficient is definedas follows [7]:

    S1; . . . ; Sk jS1 \ \ SkjjS1 [ [ Skj : 3

    We can estimate S1; . . . ; Sk using the signature vectors asfollows:

    ^S1; . . . ; Sk jfijsigS1 i sigSk igjjj : 4

    5.2 Extended MinHash

    To compute the MDL cost of each clustering quickly, wewould like to estimate the probability that a path appears ina certain number of documents in a cluster. However, thetraditional MinHash was proposed to estimate the Jaccards

    KIM AND SHIM: TEXT: AUTOMATIC TEMPLATE EXTRACTION FROM HETEROGENEOUS WEB PAGES 619

    Fig. 8. GetMDLCost/GetBestPair procedures.

    Fig. 9. An example of Jaccards coefficient. (a) Permutations.(b) Signatures of set A, B.

  • coefficient. Thus, given a collection of sets X fS1; . . . ; Skg,we extend MinHash to estimate the probabilities needed tocompute the MDL cost.

    Let us start from defining a probability that an rj 2SS2X S is included by m number of sets in X. We denote

    the probability as X;m, which is computed as follows:

    X;m jfrjjrj is included in m number of sets in XgjjS1 [ [ Skj :

    Then, X;m is defined for 1 m jXj and X; jXj is thesame as the Jaccards coefficient of sets in X.

    Now we introduce an extended signature of a collectionof k sets, X fS1; . . . ; Skg, for f1; . . . ; Lg as follows:

    sigXi minSj2X

    sigSj i; j argminSj2X

    sigSj ijfor i;

    sigX sigX1; . . . ; sigXL:

    8 0 and LCopt LC0opt

    HM0TX HM 0

    X

    HMT X HMX, The dimensions of MT;M 0T ;M;M

    0 are all the same:

    H1M0T H2M 0

    ; M 0

    H1M 0TT H2M 0

    ; M 0

    H1M0T H2M 0

    ; M 0

    H1M 0TT H1M 0

    M 0

    by P2

    H1M0T H2M 0

    ; M 0

    H1M 0T M 0

    M 0

    T by P1

    H1M0T H2M 0

    ; M 0

    H1M 0T M 0

    M 0

    by P1 and P4

    H2M0

    T

    2 ;M0

    T

    2 H2M 0 ; M 0H1M 0

    T M 0

    M 0

    by P2

    H2M0

    T

    2 ;M0

    T

    2 H2M0

    M0

    2 ;M0

    M0

    2 H1M 0

    T M 0

    M 0

    by P2

    2 H2M0

    TM0

    M0

    4 ;M0

    TM0

    M0

    4 H1M 0

    T M 0

    M 0

    by P3

    2 H24 ;

    4

    H1, Let M 0

    T M 0

    M 0

    be t: The function of t;

    H2t4 ; t4H1t ; is monotonically increasing until t ;

    which is the upper bound of t ; as shown in P5:

    2 2 log2

    4 1 2 log2 1 2

    log2 1 log2 1 :

    ut

    ACKNOWLEDGMENTS

    This work was supported by the National ResearchFoundation of Korea (NRF) Grant, funded by the KoreaGovernment (MEST) (No. 2009-0078828).

    REFERENCES[1] Document Object Model (dom) Level 1 Specification Version 1.0,

    http://www.w3.org/TR/REC-DOM-Level-1, 2010.[2] Xpath Specification, http://www.w3.org/TR/xpath, 2010.[3] A. Arasu and H. Garcia-Molina, Extracting Structured Data from

    Web Pages, Proc. ACM SIGMOD, 2003.[4] Z. Bar-Yossef and S. Rajagopalan, Template Detection via Data

    Mining and Its Applications, Proc. 11th Intl Conf. World Wide Web(WWW), 2002.

    [5] A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher,Min-Wise Independent Permutations, J. Computer and SystemSciences, vol. 60, no. 3, pp. 630-659, 2000.

    [6] D. Chakrabarti, R. Kumar, and K. Punera, Page-Level TemplateDetection via Isotonic Smoothing, Proc. 16th Intl Conf. WorldWide Web (WWW), 2007.

    [7] Z. Chen, F. Korn, N. Koudas, and S. Muithukrishnan, SelectivityEstimation for Boolean Queries, Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), 2000.

    [8] J. Cho and U. Schonfeld, Rankmass Crawler: A Crawler withHigh Personalized Pagerank Coverage Guarantee, Proc. IntlConf. Very Large Data Bases (VLDB), 2007.

    [9] T.M. Cover and J.A. Thomas, Elements of Information Theory. WileyInterscience, 1991.

    [10] V. Crescenzi, G. Mecca, and P. Merialdo, Roadrunner: TowardsAutomatic Data Extraction from Large Web Sites, Proc. 27th IntlConf. Very Large Data Bases (VLDB), 2001.

    [11] V. Crescenzi, P. Merialdo, and P. Missier, Clustering Web PagesBased on Their Structure,Data and Knowledge Eng., vol. 54, pp. 279-299, 2005.

    [12] M. de Castro Reis, P.B. Golgher, A.S. da Silva, and A.H.F. Laender,Automatic Web News Extraction Using Tree Edit Distance,Proc. 13th Intl Conf. World Wide Web (WWW), 2004.

    [13] I.S. Dhillon, S. Mallela, and D.S. Modha, Information-TheoreticCo-Clustering, Proc. ACM SIGKDD, 2003.

    KIM AND SHIM: TEXT: AUTOMATIC TEMPLATE EXTRACTION FROM HETEROGENEOUS WEB PAGES 625

  • [14] M.N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim,Xtract: A System for Extracting Document Type Descriptors fromXml Documents, Proc. ACM SIGMOD, 2000.

    [15] D. Gibson, K. Punera, and A. Tomkins, The Volume andEvolution of Web Page Templates, Proc. 14th Intl Conf. WorldWide Web (WWW), 2005.

    [16] K. Lerman, L. Getoor, S. Minton, and C. Knoblock, Using theStructure of Web Sites for Automatic Segmentation of Tables,Proc. ACM SIGMOD, 2004.

    [17] B. Long, Z. Zhang, and P.S. Yu, Co-Clustering by Block ValueDecomposition, Proc. ACM SIGKDD, 2005.

    [18] F. Pan, X. Zhang, and W. Wang, Crd: Fast Co-Clustering onLarge Data Sets Utilizing Sampling-Based Matrix Decomposi-tion, Proc. ACM SIGMOD, 2008.

    [19] M.D. Plumbley, Clustering of Sparse Binary Data Using aMinimum Description Length Approach, http://www.elec.qmul.ac.uk/staffinfo/markp/, 2002.

    [20] J. Rissanen, Modeling by Shortest Data Description, Automatica,vol. 14, pp. 465-471, 1978.

    [21] J. Rissanen, Stochastic Complexity in Statistical Inquiry. WorldScientific, 1989.

    [22] K. Vieira, A.S. da Silva, N. Pinto, E.S. de Moura, J.M.B. Cavalcanti,and J. Freire, A Fast and Robust Method for Web Page TemplateDetection and Removal, Proc. 15th ACM Intl Conf. Information andKnowledge Management (CIKM), 2006.

    [23] Y. Zhai and B. Liu, Web Data Extraction Based on Partial TreeAlignment, Proc. 14th Intl Conf. World Wide Web (WWW), 2005.

    [24] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, FullyAutomatic Wrapper Generation for Search Engines, Proc. 14thIntl Conf. World Wide Web (WWW), 2005.

    [25] H. Zhao, W. Meng, and C. Yu, Automatic Extraction of DynamicRecord Sections from Search Engine Result Pages, Proc. 32nd IntlConf. Very Large Data Bases (VLDB), 2006.

    [26] S. Zheng, D. Wu, R. Song, and J.-R. Wen, Joint Optimization ofWrapper Generation and Template Detection, Proc. ACMSIGKDD, 2007.

    Chulyun Kim received the BS degree incomputer engineering, the MS degree in cogni-tive science, and the PhD degree in electricalengineering and computer science from SeoulNational University, in 1996, 1998, and 2010,respectively. He received the Microsoft Re-search Asia Fellowship in 2004. He is currentlyan assistant professor at Kyungwon University,Korea. His research is focused on data mining,web mining, data approximation, and bioinfor-

    matics and interdisciplinary informatics.

    Kyuseok Shim received the BS degree inelectrical engineering from Seoul National Uni-versity in 1986, and the MS and PhD degrees incomputer science from the University of Mary-land, College Park, in 1988 and 1993, respec-tively. He is currently a professor at SeoulNational University, Korea. Before that, he wasan assistant professor at KAIST, a member oftechnical staff for the Serendip Data Mining

    Project at Bell Laboratories, and a research staff for the Quest DataMining Project at the IBM Almaden Research Center. He has beenworking in the area of databases focusing on data mining, data privacy,Internet search engines, data warehousing, query processing, queryoptimization, histograms, and XML. His writings have appeared in anumber of professional conferences and journals including ACM andIEEE publications. He served previously on the editorial board of theVLDB and TKDE Journals. He also served as a PC member forSIGKDD, SIGMOD, ICDE, ICDM, PAKDD, VLDB, and WWW confer-ences. He is a member of the IEEE.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    626 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011