Enriching XML Documents Clustering by using Concise ... · Sangeetha Kutty MCIS. (Auckland...

Enriching XML Documents Clustering by using

Concise Structure and Content

By

Sangeetha Kutty MCIS. (Auckland University of Technology, New Zealand),

B.Eng. (University of Madras, India)

Thesis submitted for the degree of Doctor of Philosophy to the

Faculty of Science and Technology at

Queensland University of Technology

Brisbane, Queensland, Australia

2011

Keywords

XML documents, clustering, frequent subtree mining, structure, content, Vector Space

Model (VSM), tensors, Tensor Space Model (TSM), paths, graphs, trees, subtrees, induced

subtrees, embedded subtrees, constraints, closed, maximal, apriori, prefix-based pattern

growth, matricization, INEX, Wikipedia, ACM dataset, IEEE dataset, tensor, random

projection, random indexing.

iii

Abstract

With the growing number of XML documents on the Web it becomes essential to effectively

organise these XML documents in order to retrieve useful information from them. A

possible solution is to apply clustering on the XML documents to discover knowledge that

promotes effective data management, information retrieval and query processing. However,

many issues arise in discovering knowledge from these types of semi-structured documents

due to their heterogeneity and structural irregularity. Most of the existing research on

clustering techniques focuses only on one feature of the XML documents, this being either

their structure or their content due to scalability and complexity problems. The knowledge

gained in the form of clusters based on the structure or the content is not suitable for real-

life datasets. It therefore becomes essential to include both the structure and content of

XML documents in order to improve the accuracy and meaning of the clustering solution.

However, the inclusion of both these kinds of information in the clustering process results in

a huge overhead for the underlying clustering algorithm because of the high dimensionality

of the data.

The overall objective of this thesis is to address these issues by: (1) proposing methods

to utilise frequent pattern mining techniques to reduce the dimension; (2) developing mod-

els to effectively combine the structure and content of XML documents; and (3) utilising

the proposed models in clustering. This research first determines the structural similarity

in the form of frequent subtrees and then uses these frequent subtrees to represent the

constrained content of the XML documents in order to determine the content similarity.

A clustering framework with two types of models, implicit and explicit, is developed.

The implicit model uses a Vector Space Model (VSM) to combine the structure and

the content information. The explicit model uses a higher order model, namely a 3-

order Tensor Space Model (TSM), to explicitly combine the structure and the content

information. This thesis also proposes a novel incremental technique to decompose large-

sized tensor models to utilise the decomposed solution for clustering the XML documents.

The proposed framework and its components were extensively evaluated on several

v

real-life datasets exhibiting extreme characteristics to understand the usefulness of the pro-

posed framework in real-life situations. Additionally, this research evaluates the outcome

of the clustering process on the collection selection problem in the information retrieval on

the Wikipedia dataset. The experimental results demonstrate that the proposed frequent

pattern mining and clustering methods outperform the related state-of-the-art approaches.

In particular, the proposed framework of utilising frequent structures for constraining the

content shows an improvement in accuracy over content-only and structure-only clustering

results. The scalability evaluation experiments conducted on large scaled datasets clearly

show the strengths of the proposed methods over state-of-the-art methods.

In particular, this thesis work contributes to effectively combining the structure and

the content of XML documents for clustering, in order to improve the accuracy of the

clustering solution. In addition, it also contributes by addressing the research gaps in

frequent pattern mining to generate efficient and concise frequent subtrees with various

node relationships that could be used in clustering.

vi

Table of Contents

Keywords iii

Abstract v

List of Figures xi

List of Tables xv

Glossary xvii

Statement of Original Authorship xix

Publications xxi

Acknowledgements xxiii

Chapter 1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Research aims & objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Research significance and contributions . . . . . . . . . . . . . . . . . . . . 9

1.5 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 2 Background and Literature Review 13

2.1 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Data models for XML mining . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.1 Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.4 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 XML clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Based on structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.1.1 Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . 29

2.3.1.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.1.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3.1.4 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3.2 Based on content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


vii

2.3.3 Based on structure and content . . . . . . . . . . . . . . . . . . . . . 37


2.3.3.2 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3.3.3 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.3.3.4 Tensor Space Model (TSM) . . . . . . . . . . . . . . . . . . 41

2.3.4 Research gaps in XML clustering . . . . . . . . . . . . . . . . . . . . 43

2.4 Frequent pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.4.1 An overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.4.2 Frequent pattern mining methods . . . . . . . . . . . . . . . . . . . . 47


2.4.2.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.4.2.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.4.2.4 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.4.3 Research gaps in frequent pattern mining . . . . . . . . . . . . . . . 59

2.5 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Chapter 3 Research Design 63

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.2.1 Phase-One: Pre-processing . . . . . . . . . . . . . . . . . . . . . . . 64

3.2.2 Phase-Two: Frequent Pattern Mining . . . . . . . . . . . . . . . . . 66

3.2.3 Phase-Three (a): Clustering using VSM . . . . . . . . . . . . . . . . 67

3.2.4 Phase-Three (b): Clustering using TSM . . . . . . . . . . . . . . . . 67

3.3 Experiment Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4.2 Real-life Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.4.2.1 Small-sized real-life dataset . . . . . . . . . . . . . . . . . . 69

3.4.2.2 Medium-sized real-life dataset . . . . . . . . . . . . . . . . 70

3.4.2.3 Large-sized real-life datasets . . . . . . . . . . . . . . . . . 71

3.5 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.5.1 Frequent pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.5.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.5.3 Collection selection evaluation using NCCG measure . . . . . . . . . 80

3.6 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.6.1 Frequent pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.6.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.6.2.1 Based on representations . . . . . . . . . . . . . . . . . . . 84

3.6.2.2 Based on other clustering methods from INEX . . . . . . . 85

3.6.2.3 Clustering using different tensor decompositions . . . . . . 88

3.7 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Chapter 4 Frequent Pattern Mining of XML Documents 91

viii

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.2 Pre-Processing of the structure in XML documents . . . . . . . . . . . . . . 93

4.3 Types of subtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.3.1 Concise Frequent Induced (CFI) subtrees . . . . . . . . . . . . . . . 95

4.3.2 Concise Frequent Embedded (CFE) subtrees . . . . . . . . . . . . . 99

4.4 Frequent subtree mining: Background . . . . . . . . . . . . . . . . . . . . . 102

4.4.1 The 1-Length frequent subtree generation . . . . . . . . . . . . . . . 102

4.4.2 Projecting the dataset using the prefix trees . . . . . . . . . . . . . . 104

4.5 Concise frequent subtree mining: Proposed techniques . . . . . . . . . . . . 106

4.5.1 Search space reduction using the backward scan . . . . . . . . . . . 107

4.5.2 Node extension concise checking . . . . . . . . . . . . . . . . . . . . 108

4.6 Methods using the proposed techniques for generating concise frequent sub-

trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.6.1 Generating concise frequent induced subtrees . . . . . . . . . . . . . 111

4.6.1.1 Prefix-based Closed Induced Tree Miner (PCITMiner) . . . 114

4.6.1.2 Prefix-based Maximal Induced Tree Miner (PMITMiner) . 115

4.6.1.3 Length Constrained Prefix-based Closed Induced Tree Miner

(PCITMinerConst) . . . . . . . . . . . . . . . . . . . . . . 116

4.6.1.4 Length Constrained Prefix-based Maximal Induced Tree

Miner (PMITMinerConst) . . . . . . . . . . . . . . . . . . 117

4.6.2 Generating concise frequent embedded subtrees . . . . . . . . . . . . 118

4.6.2.1 Prefix-based Closed Embedded Tree Miner (PCETMiner) . 119

4.6.2.2 Prefix-based Maximal Embedded Tree Miner (PMETMiner)119

4.6.2.3 Length Constrained Prefix-based Closed Embedded Tree

Miner (PCETMinerConst) . . . . . . . . . . . . . . . . . . 120

4.6.2.4 Length Constrained Prefix-based Maximal Embedded Tree

Miner (PMETMinerConst) . . . . . . . . . . . . . . . . . . 121

4.7 Empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

4.7.1 Evaluation of frequent pattern mining methods on synthetic datasets 123

4.7.2 Evaluation of frequent pattern mining methods on real-life datasets 128

4.7.2.1 On small-sized real-life dataset . . . . . . . . . . . . . . . . 128

4.7.2.2 On medium-sized real-life dataset . . . . . . . . . . . . . . 130

4.7.2.3 On large-sized real-life datasets . . . . . . . . . . . . . . . . 131

4.8 Discussion and summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

4.8.1 Algorithmic Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

4.8.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

4.9 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Chapter 5 XML Clustering 145

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5.2 Hybrid Clustering of XML Documents (HCX) Methodology : An Overview 146

5.2.1 Hybrid Clustering of XML documents using the Vector Space Model

(HCX-V) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

ix

5.2.2 Hybrid Clustering of XML documents using the Tensor Space Model

(HCX-T) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.3 Using the Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . . . . 148

5.3.1 Identifying the coverage of concise frequent subtrees . . . . . . . . . 149

5.3.2 Pre-processing of the structure-constrained content of XML documents153

5.3.3 Representation of the structure-constrained content in ICF . . . . . 155

5.3.4 Similarity measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.4 Using the Tensor Space Model (TSM) . . . . . . . . . . . . . . . . . . . . . 158

5.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.4.1.1 Tensor concepts . . . . . . . . . . . . . . . . . . . . . . . . 159

5.4.1.2 Tensor operations . . . . . . . . . . . . . . . . . . . . . . . 161

5.4.1.3 Tensor decomposition techniques . . . . . . . . . . . . . . . 163

5.4.2 Modelling in tensor space – An overview . . . . . . . . . . . . . . . . 165

5.4.3 Generation of structure features for TSM . . . . . . . . . . . . . . . 168

5.4.4 Generation of content features for TSM . . . . . . . . . . . . . . . . 169

5.4.5 The TSM representation, decomposition and clustering . . . . . . . 171

5.5 Empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

5.5.1 Accuracy of clustering methods . . . . . . . . . . . . . . . . . . . . . 175

5.5.1.1 ACM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 175

5.5.1.2 DBLP dataset . . . . . . . . . . . . . . . . . . . . . . . . . 179

5.5.1.3 INEX2007 dataset . . . . . . . . . . . . . . . . . . . . . . . 181

5.5.1.4 INEX IEEE dataset . . . . . . . . . . . . . . . . . . . . . . 183

5.5.1.5 INEX 2009 dataset . . . . . . . . . . . . . . . . . . . . . . 183

5.5.2 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

5.5.3 Time complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . 188

5.5.4 Scalability analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

5.6 Discussion and summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

Chapter 6 Conclusion 201

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

6.2 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

6.3 Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

6.4 Limitations and future extensions . . . . . . . . . . . . . . . . . . . . . . . . 205

6.5 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Bibliography 209

Appendices 227

Chapter A Details of the real-life datasets 227

Chapter B Empirical Evaluation of Frequent Mining results 231

x

List of Figures

1.1 Using clustering in Information Retrieval . . . . . . . . . . . . . . . . . . . 4

1.2 A sample XML dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Classification of XML data . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Sample DTD (conf.dtd) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Sample XSD (conf.xsd) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Sample XML document (conf.xml) . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Classification of XML models . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 An example of (a) a dense representation; (b) a sparse representation of an

XML dataset modelled in VSM using their feature frequency . . . . . . . . 21

2.7 Sample XML fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.8 (a) A graph; (b) a labelled graph; (c) a directed graph . . . . . . . . . . . . 24

2.9 Graph representation of conf.dtd . . . . . . . . . . . . . . . . . . . . . . . . 25

2.10 Tree representation of the XML document given in Figure 2.4 . . . . . . . . 26

2.11 Paths derived from XML document model (in Figure 2.4): (a) A complete

path; (b) a partial path; (c) a complete path with text node . . . . . . . . . 27

2.12 Hierarchy of XML frequent pattern mining . . . . . . . . . . . . . . . . . . 45

2.13 Example of a subtree from the sample XML dataset in Figure 1.2 . . . . . 52

2.14 (a) A tree; (b) an induced subtree; (c) an embedded subtree . . . . . . . . . 54

2.15 A sample tree using node labels as alphabets instead of the tag names . . . 54

2.16 (a) a document tree dataset; (b) frequent patterns and their projections in

that dataset using pattern-growth approach . . . . . . . . . . . . . . . . . . 55

3.1 Research design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.1 The pre-processing phase for structure of XML documents . . . . . . . . . . 93

4.2 (a) a document tree DTp; (b) Prefix trees of (a) . . . . . . . . . . . . . . . . 103

4.3 Algorithm for generating concise frequent subtrees . . . . . . . . . . . . . . 111

4.4 Function Fre for generating concise frequent subtrees . . . . . . . . . . . . 112

4.5 Classification of the proposed methods . . . . . . . . . . . . . . . . . . . . . 113

4.6 Runtime and number of subtrees comparison on F5 dataset . . . . . . . . . 124

4.7 Runtime and number of length constrained frequent concise subtrees com-

parison on F5 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.8 Runtime and number of subtrees comparison on the D10 dataset . . . . . . 127

4.9 Runtime and number of subtrees comparison on ACM dataset . . . . . . . . 129

xi


parison on ACM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

4.11 Runtime and number of subtrees comparison on DBLP dataset . . . . . . . 130


parison on DBLP dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

4.13 Runtime and number of subtrees comparison on INEX2007 dataset . . . . . 132

4.14 Runtime and number of length constrained frequent induced subtrees com-

parison on INEX2007 dataset at 20% . . . . . . . . . . . . . . . . . . . . . . 133


parison on INEX2007 dataset at 50% . . . . . . . . . . . . . . . . . . . . . . 133


parison on INEX IEEE dataset at 20% and 50% . . . . . . . . . . . . . . . 134

4.17 Runtime and number of length constrained frequent embedded subtrees

comparison on INEX IEEE dataset at 50% . . . . . . . . . . . . . . . . . . 134


parison on INEX 2009 dataset at 20% and 50% . . . . . . . . . . . . . . . . 135

4.19 Runtime and number of length constrained frequent embedded subtrees

comparison on the INEX 2009 dataset at 50% . . . . . . . . . . . . . . . . . 136

4.20 Comparison of the runtimes vs number of frequent subtrees on ACM, DBLP

and INEX 2007 datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

4.21 Comparison of the runtimes vs number of frequent subtrees on INEX IEEE

and INEX 2009 datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5.1 Hybrid Clustering of XML documents (HCX) methodology . . . . . . . . . 147

5.2 High level definition of HCX-V approach . . . . . . . . . . . . . . . . . . . . 149

5.3 Sparse representation of a XML dataset modelled in VSM using their term

frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

5.4 Comparison of VSM and TSM: (a) sample XML document; (b) concise

frequent subtrees; (c) Vector Space Model (VSM) for (a) and (b) using

HCX-V; and (d) Tensor Space Model(TSM) for (a) and (b). . . . . . . . . . 158

5.5 Comparison of vector, matrix and tensor . . . . . . . . . . . . . . . . . . . 160

5.6 Fibers of a mode-3 tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

5.7 Slices of a mode-3 tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

5.8 Mode-1 matricization of a mode-3 tensor . . . . . . . . . . . . . . . . . . . . 162

5.9 mode-n matricization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

5.10 Visualisation of a mode-3 tensor for the XML document dataset . . . . . . 166

5.11 High level definition of HCX-T approach . . . . . . . . . . . . . . . . . . . . 167

5.12 Illustration of Random Indexing (RI) on a mode-3 tensor resulting in a

randomly reduced tensor Tr. . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

5.13 Progressive Tensor Creation and Decomposition algorithm (PTCD) . . . . 173

5.14 Results of clustering on the ACM dataset using 5 categories . . . . . . . . . 176

5.15 Results of clustering on the ACM dataset using 2 categories . . . . . . . . . 177

xii

5.16 Results of clustering on the ACM dataset using different types of subtrees

for HCX-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

5.17 Results of clustering on the ACM dataset using different types of subtrees

for HCX-T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

5.18 Impact of RI on the quality of individual clusters on ACM dataset . . . . . 178

5.19 Results of clustering on the DBLP dataset . . . . . . . . . . . . . . . . . . . 179

5.20 Results of clustering on different types of concise frequent subtrees on the

DBLP dataset using HCX-V . . . . . . . . . . . . . . . . . . . . . . . . . . 180

5.21 Results of clustering on different types of concise frequent subtrees on the

DBLP dataset using HCX-T . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

5.22 Results of clustering methods using different types of concise frequent sub-

trees on the INEX 2007 dataset using HCX-V . . . . . . . . . . . . . . . . . 182

5.23 A comparison of the NCCG values of the different clustering methods on

the INEX 2009 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

5.24 A comparison of the number of clusters on the NCCG values . . . . . . . . 185

5.25 A comparison of the different clustering methods on the INEX 2009 dataset

using cumulative recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

5.26 Cumulative gain for the topic id 2009005 . . . . . . . . . . . . . . . . . . . . 186

5.27 Cumulative gain for the topic id 2009043 . . . . . . . . . . . . . . . . . . . . 187

5.28 Sensitivity of length constraint on the micro- and macro-purity values for

INEX IEEE dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

5.29 Scalability of HCX-T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

5.30 Scalability of PTCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

5.31 Scalability of the decomposition in PTCD . . . . . . . . . . . . . . . . . . . 189

5.32 Comparison of the proposed clustering methods over the state-of-the-art

clustering methods on the large-sized datasets . . . . . . . . . . . . . . . . . 191

5.33 Comparison of different types of concise frequent subtrees on clustering

based on datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

5.34 Comparison of different types of concise frequent subtrees on clustering . . 192

5.35 Comparison of tensor decomposition algorithms . . . . . . . . . . . . . . . . 193

5.36 A comparison of the average of all metrics in the chosen real-life datasets . 195

5.37 A comparison of the number of terms in the chosen real-life datasets . . . . 195

5.38 A comparison of the weighting schemes - tf-idf and BM-25 . . . . . . . . . 198

xiii

List of Tables

2.1 VSM generated from the structure of XML document given in Figure 2.4 . 21

2.2 Transactional data model generated from the content of XML document

given in Figure 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Comparison of different types of clustering methods using structure of XML

documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.4 Popular tensor decomposition algorithms based on CP and Tucker . . . . . 42

2.5 Classifications of frequent tree mining methods . . . . . . . . . . . . . . . . 52

3.1 Synthetic datasets and their parameters . . . . . . . . . . . . . . . . . . . . 69

3.2 Details of categories in the ACM dataset . . . . . . . . . . . . . . . . . . . . 70

3.3 Details of ACM and DBLP datasets . . . . . . . . . . . . . . . . . . . . . . 71

3.4 Details of categories in the DBLP dataset . . . . . . . . . . . . . . . . . . . 71

3.5 Details of categories in the INEX IEEE dataset . . . . . . . . . . . . . . . . 72

3.6 Details of categories in the INEX 2007 dataset . . . . . . . . . . . . . . . . 73

3.7 Details of large-sized datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.8 Details of the top-20 categories in the INEX 2009 dataset using Wikipedia

categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.9 Details of the top-20 categories in the INEX 2009 dataset using ad hoc queries 75

3.10 Benchmarks for frequent pattern mining methods . . . . . . . . . . . . . . . 84

3.11 Benchmarks for clustering methods . . . . . . . . . . . . . . . . . . . . . . . 84

4.1 Document tree dataset example (DT ) . . . . . . . . . . . . . . . . . . . . . 95

4.2 Frequent induced subtrees generated from DT (in Table 4.1) using prefix-

pattern growth approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.3 Closed Frequent Induced subtrees generated from DT (in Table 4.1) . . . . 97

4.4 Maximal Frequent Induced subtrees generated from DT (in Table 4.1) . . . 97

4.5 Length Constrained Closed Frequent Induced subtrees generated from DT

(in Table 4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.6 Length Constrained Maximal Frequent Induced subtrees generated from

DT (in Table 4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.7 Frequent embedded subtrees generated from DT (in Table 4.1) using prefix-

pattern growth methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.8 Closed Frequent Embedded (CFE) subtrees generated from DT (in Table

4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.9 Maximal Frequent Embedded (MFE) subtrees generated from DT (in Table

4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

xv

4.10 Length Constrained Closed Frequent Embedded (CFEConst) subtrees gen-

erated from DT (in Table 4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.11 Length Constrained Maximal Frequent Embedded (MFEConst) subtrees

generated from DT (in Table 4.1) . . . . . . . . . . . . . . . . . . . . . . . . 102

4.12 < A1 − 1 > projected instances dataset . . . . . . . . . . . . . . . . . . . . 105

4.13 < B1 − 1 > projected instances dataset . . . . . . . . . . . . . . . . . . . . 105

4.14 < A1B2 − 1− 1 > projected instances dataset . . . . . . . . . . . . . . . . . 105

4.15 Runtime comparison of length constrained subtrees on the D10 dataset . . 126

4.16 Length constrained subtrees in the D10 dataset . . . . . . . . . . . . . . . . 127

4.17 Summary of frequent pattern mining results on synthetic datasets . . . . . 136

4.18 Summary of frequent pattern mining results on real-life datasets . . . . . . 137

5.1 Tensor notations and descriptions . . . . . . . . . . . . . . . . . . . . . . . . 159

5.2 Summary of the term size and tensor entries in INEX 2009 and INEX IEEE

datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.3 Impact of dimensionality reduction on the clustering results on the ACM

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

5.4 Impact of dimensionality reduction on the clustering results on the DBLP

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

5.5 Results of clustering on the INEX2007 dataset . . . . . . . . . . . . . . . . 181

5.6 Results of clustering on the INEX IEEE dataset using 18 categories . . . . 183

5.7 Results of clustering on the INEX 2009 dataset . . . . . . . . . . . . . . . . 184

5.8 Details of ad hoc queries with large categories . . . . . . . . . . . . . . . . . 185

5.9 Constraint lengths for the real-life datasets . . . . . . . . . . . . . . . . . . 197

A.1 Details of all the categories in INEX 2009 dataset using Wikipedia categories228

A.2 Details of all categories in INEX 2009 dataset using ad hoc queries ordered

by the topic Id . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

B.1 Runtime comparison of Length Constrained Subtrees on F5 dataset . . . . 231

B.2 Length Constrained Subtrees in F5 dataset . . . . . . . . . . . . . . . . . . 232

xvi

Glossary of terms and abbreviations

CFE Closed Frequent Embedded Subtrees.

CFI Closed Frequent Induced Subtrees.

Content in XML document The text between the start and end tag in an XML

document. For instance, <Title> Data Mining </Title>, content refers to the text “Data

Mining” between the tags <Title></Title>.

Frequent Patterns Patterns which occurs more than a user-specified threshold limit

(minimum support or min supp) in a given dataset.

Frequent Patterns Mining A data mining task focussing on the extraction of frequent

patterns from a given data.

HCX-V Hybrid Clustering of XML documents using Vector Space Model (VSM).

HCX-T Hybrid Clustering of XML documents using Tensor Space Model (TSM).

INEX INitiative for Evaluation of XML Retrieval.

IR Information Retrieval.

LSI Latent Semantic Indexing.

MFE Maximal Frequent Embedded Subtrees.

MFI Maximal Frequent Induced Subtrees.

PCA Principal Component Analysis.

PCETMiner Prefix-based Closed Embedded Tree Miner.

PCETMinerConst Length Constrained Prefix-based Closed Embedded Tree Miner.

xvii

PCITMiner Prefix-based Closed Induced Tree Miner.

PCITMinerConst Length Constrained Prefix-based Closed Induced Tree Miner.

PMETMiner Prefix-based Maximal Embedded Tree Miner.

PMETMinerConst Length Constrained Prefix-based Maximal Embedded Tree Miner.

PMITMiner Prefix-based Maximal Induced Tree Miner.

PMITMinerConst Length Constrained Prefix-based Maximal Induced Tree Miner.

RI Random Indexing.

Structures in XML document The element tags and their nesting dictate the struc-

ture of an XML document.

Subtree The tree which is a child of a node in another tree.

SVD Singular Value Decomposition.

TSM Tensor Space Model.

VSM Vector Space Model.

XML eXtensible Markup Language.

XML Frequent Patterns Mining Mining of XML documents for frequent patterns

which are either structure or content-oriented or a combination of both.

xviii

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet requirements

for an award at this or any other higher education institution. To the best of my knowledge

and belief, the thesis contains no material previously published or written by another

person except where due reference is made and except for one of the evaluation measures

for clustering, collection selection evaluation, discussed in Section 3.5.3 in Chapter 3, which

was developed in collaboration with other volunteers in the clustering task in the INEX

forum. Also, the concept of cumulative recall plots for clustering discussed in subsection

5.5.1.5 in Section 5.5 in Chapter 5 was developed by Chris de Vries, a team member in

the INEX forum.

Signature:

Date:

xix

Publications Derived from this Thesis

1. Kutty, S., R. Nayak, and Y. Li. XML documents clustering using tensor space

model, in proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery

and Data Mining (PAKDD 2011), Shenzen, China (to appear in 2011).

2. Kutty, S., T. Tran, and R. Nayak, A study of XML models for data mining: repre-

sentations, methods, and issues in XML data mining: models, methods, and appli-

cations, A. Tagarelli, Editor, Idea Group Inc., USA (to appear in 2011).

3. Kutty, S., R. Nayak, and Y. Li, Utilising semantic tags in XML clustering, in Focused

Retrieval and Evaluation, S. Geva, J. Kamps, and A. Trotman, Editors. 2010,

Springer Berlin / Heidelberg. p. 416-425.

4. Kutty, S., R. Nayak, and Y. Li. XML documents clustering using tensor space

model-A preliminary study, in proceedings of the IEEE ICDM 2010 Workshop on

Optimization Based Methods for Emerging Data Mining Problems (OEDM ’10).

2010, Sydney, Australia.

5. Kutty, S., R. Nayak, T. Tran, and Y. Li. Clustering XML documents using frequent

subtrees, in Advances in Focused Retrieval, S. Geva, J. Kamps, and A. Trotman,

Editors. 2009, Springer Berlin / Heidelberg. p. 436-445.

6. Kutty, S., R. Nayak, and Y. Li, XCFS: an XML documents clustering approach using

both the structure and the content, in proceedings of the 18th ACM conference on

Information and knowledge management. 2009, ACM: Hong Kong, China. p. 1729-

1732.

7. Kutty, S., R. Nayak, and Y. Li, HCX: an efficient hybrid clustering approach for XML

documents, in proceedings of the 9th ACM symposium on Document engineering.

xxi

2009, ACM: Munich, Germany. p. 94-97.

8. Kutty, S., R. Nayak, and Y. Li, XML data mining: process and applications, in

Handbook of Research on Text and Web Mining Technologies, M. Song and Y.-F.

Wu, Editors. 2008, Idea Group Inc., USA.

9. Kutty, S., R. Nayak, T. Tran, and Y. Li. Clustering XML documents using closed

frequent subtrees: A structural similarity approach, in Advances in Focused Re-

trieval, N. Fuhr, J. Kamps, M. Lalmas and A. Trotman, Editors. 2008, Springer

Berlin / Heidelberg. p. 183-194.

10. Kutty, S., R. Nayak, and Y. Li, PCITMiner: prefix-based closed induced tree miner

for finding closed induced frequent subtrees, in proceedings of the sixth Australasian

conference on Data mining and analytics - Volume 70. 2007, Australian Computer

Society, Inc.: Gold Coast, Australia. p. 151-160.

xxii

Acknowledgements

I would like to express my sincere gratitude and deep appreciation to my principal

supervisor, Dr. Richi Nayak, for her continuous guidance, encouragement, and support

throughout this research. The achievements in this thesis would not be possible without

her supervision. I also thank Prof. Yuefeng Li, my associate supervisor, for his valuable

suggestions concerning my research and for helping me with my publications. I am thankful

to the QUT faculty-based award (QUTFBA) for funding me and my research.

Thanks to Dr. Lei Zhou for providing the PrefixTreeISpan and PrefixTreeESpan for

benchmarking purposes. Also, a special thanks to Prof. Mohammed Javeed Zaki for the

TreeMinerV.

My thanks also go to the volunteers of the INEX forum for the availability of the

INEX datasets and the evaluation methods that have been used in this thesis. My special

thanks to QUT’s High Performance Computing (HPC) team for providing the facilities to

conduct experiments on HPC systems.

I am grateful to my husband, Mr. Anand Kutty and my children Preeti Kutty and

Deepti Kutty for their love, understanding, support and patience. I am indebted to

my parents-in-law, Dr. Kutty Venkatesan and Dr. Suguna Venkatesan for their constant

support and advice during difficult times. My special thanks to my beloved parents,

Mr. R. K. Srinivasan and Mrs. R. Udayakumari for their love, encouragement and support

throughout my studies.

Further thanks are for my colleagues in Faculty of Science and Technology, Mrs. Dine-

sha Weragama, Mr. Reza Hassanzadeh, Mrs. Esther Ge, Mr. Rakesh Rawat, Mr. Daniel

Emerson, Mr. Aishwarya Bose and Mr. Paul de Braak for creating a friendly environment

to share our knowledge. My special thanks to Mrs. Tien Tran for her suggestions and

xxiii

co-operation throughout this research.

Thank you everyone for providing me with the opportunity to do this research.

xxiv

Chapter 1

Introduction

With increasingly distributed intranets and with the massive growth of the Internet, XML

(eXtensible Markup Language) has now become a ubiquitous standard for information

representation and exchange for both intranet and Internet [120]. Due to the simplicity

and flexibility of XML, a diverse variety of applications ranging from scientific literature

and technical documents to handling news summaries [104] utilise XML in information

representation and exchange. More than 50 domain-specific languages have been devel-

oped based on XML [25], such as MovieXML for encoding movie scripts, GraphML for

exchanging graph structured data, Geography Markup Language (GML) for expressing ge-

ographical features and interchanging them over the Internet, Twitter Markup Language

(TML) for structuring the twitter streams, Chemical Markup Language, Mathematics

Markup Language(MathML)[101] and many others. XML has also been used to represent

the web-based free-content encyclopedia known as Wikipedia, which has more than 3.4

million XML documents.

The increased popularity of XML has raised many issues regarding the methods of

how to effectively manage the XML data and retrieve these XML documents in large

collections. A possible solution to the problem of handling large XML collections is to

1

group similar XML documents. This task of grouping in data mining is referred to as

clustering. Clustering task groups unknown data into smaller groups according to the

data commonality without having any prior knowledge about the dataset. The clustering

of similar XML documents has been perceived as potentially being one of the more effective

solutions to improve document handling by facilitating better information retrieval, data

indexing, data integration and query processing [109].

In spite of its potential, there are several challenges in clustering XML documents.

Unlike the clustering of text documents or flat data, clustering of XML documents is

an intricate process [69] and consequently the most commonly used clustering methods

for text clustering cannot be used for clustering these documents. This is due to the

fact that XML documents are semi-structured in nature and have a flexible structure as

well as their content showing the semantics. The semi-structured nature of XML data

requires the computation of similarity by including their structural similarity. However,

the inclusion of structure increases the dimensionality that the clustering method needs

to handle.

XML gained its popularity because of its structure and its inherent flexibility in repre-

senting content. However, most of the XML clustering methods adopt a naıve approach by

utilising either only the content features and ignoring its structure features or its structure

features and not its content features. Nevertheless, these methods, with their single-feature

focus, have a significant cost associated with them, since all the valuable information that

is embedded in the documents is potentially lost. Hence, they tend to falsely group docu-

ments that are similar in both features. To correctly identify similarity among documents,

the clustering process should use both their structure and their content information.

This research focuses on finding whether combining the structure and the content of

XML documents improves the accuracy of the clustering results. With the explosion in

2

the number of XML documents, clustering just the content of XML documents itself is

expensive. To combine the structure of XML documents along with the content will add

more complexities. Hence, it is essential to have an effective and efficient pre-processing

technique to reduce the dimensionality of XML documents for such a combination. This

research identifies ways of utilising the frequent patterns to reduce the dimension and

combine both the structure and the content of XML documents for use in clustering. The

structure and the content of the XML documents can be combined either implicitly or

explicitly. These combinations do not only allow for the reduction in the dimensionality of

the terms but also prove to be efficient in improving the quality of the clustering solution

over varied types of datasets.

This dissertation will explore two main areas. Firstly, it looks into frequent pattern

mining to generate concise frequent patterns for the efficient pre-processing of XML doc-

uments for clustering. Secondly, it also looks into clustering of XML documents for com-

bining the structure and the content using two types of combination, implicit and explicit,

by employing the concise frequent patterns generated before. The methods proposed in

both frequent pattern mining and clustering are evaluated against other state-of-the-art

methods using a number of datasets showing diverse characteristics.

1.1 Motivation

With the growing number of XML documents on the Internet and organisational intranets

the need becomes inevitable to effectively organise these XML documents in order to

retrieve useful information from them. The absence of such an effective organisation of

the XML documents causes a search of the entire collection of XML documents. This

search will not only be computationally expensive but also could result in a poor response

time for even simple queries.

3

In order to effectively manage the XML documents collection, it is indispensable to

apply clustering methods on these documents to group them based on their similarity.

Figure 1.1 shows an information retrieval scenario using the clustering of XML documents.

In this scenario, an user has an information need and hence makes a request using a query

to the Information Retrieval (IR) system. Instead of searching the entire collection, the

efficiency and the precision of the search engine can be improved if the retrieval system

searches only a subset of the collection in the form of clusters of documents.

Cluster 1

Cluster 2

Cluster N

XML Documents Collection

Query

Answer list

Information Need

Retrieval

IRsystem

Figure 1.1: Using clustering in Information Retrieval

In addition to applications in IR, clustering can also be applied to discover knowledge

for effective data/schema management, web mining and query processing. In spite of

the benefits of using XML, XML documents clustering is not as trivial a process as the

clustering of text documents. Instead there arise many challenges in the clustering of XML

documents due to the nature of these documents. They are:

� The presence of two main features. Unlike text documents, which are unstructured,

XML documents are semi-structured in nature and contains two features – Structure

and Content. The structure of the XML documents is used to store its content hence

the clustering method should not only be applied on one feature but should instead

be applied on both of these features.

4

� A hierarchical relationship. The structure of XML documents maintains a hierar-

chical relationship among its elements. Hence, this relationship should be preserved

while clustering.

� User defined tags. XML allows users to create their own tags. This flexibility

in design results in polysemy problems. The same tag name can convey different

meanings based on the context in different XML documents. For example, the tag

“bank” can mean “a financial institution”, “a river bank” or as a verb “to rely upon”.

The example shown in Figure 1.2 reveals the importance of using both the structure

and content features for XML clustering. Figure 1.2 shows the fragments of six XML

documents from the publishing domain: the XML fragments shown in (a), (b), (c) and

(e) share the same structure and the fragments in (d) and (f) share a similar structure.

It can be noted in Figure 1.2 that although the fragments in (a) and (b) have a

similar structure to fragments in (c) and (e), these two sets of fragments differ in their

content. Utilising a clustering method based only on the structure similarity of XML

documents will result in two clusters about “Books” and “Conference Articles”. However,

this fails to further distinguish the documents in the “Books” cluster based on the subjects

“Biology” and “Data Mining”, resulting in meaningless clusters. On the other hand,

utilising a clustering method based only on content similarity provides clusters based only

on the subject and not on the type of publication and hence fails to distinguish between

“Books” and “Conference Articles”. In order to derive meaningful clusters, these fragments

should be analysed in terms of both their structure and content similarity. Clustering the

XML documents by considering the structure and content features together will result

in three clusters, namely “Books on Data Mining (DM)”, “Books on Biology (Bio)” and

“Conference articles on Data Mining” having (a) and (b), (c) and (e), and (d) and (f)

fragments respectively. These kinds of meaningful clusters could be used for the effective

5

On the Origin of Species

Book

Title Author

Name

Publisher

Name

Charles Darwin John Murray

Book

Title Author Publisher

Eibe Frank

Data Mining: Practical Machine

Learning Tools and Techniques Addison Wesley

Name Name

ConfTitle

Conference

ConfAuthor ConfLoc

John SmithSurvey of Clustering Techniques

ICDM

ConfName

LA

ConfTitle

Conference

ConfAuthor ConfYear

Michael Bonchi

An exploratory study on

Frequent Pattern mining

AusDM

ConfName

2007

Book

Title Author Publisher

John Brown

Classification of Plants Species

Cambridge Press

Name Name

(a)

(c)

(e)

(b)

(d)

(f)

Name

Morgan Kaufmann

Book

Title Author

Name

Publisher

Data Mining concepts and Techniques

Micheline Kamber

Figure 1.2: A sample XML dataset

storage and retrieval of XML documents.

The clustering task on XML documents involves grouping the XML documents with-

out any prior knowledge according to the structure and content similarities among them.

Clustering methods utilising only the structure features of the documents cannot accu-

rately group the documents that are similar in structure but diverse in content. On the

6

other hand, clustering methods utilising only the content features of the documents con-

sider the documents as a “bag of words” and ignore the structure features [73]. The

disadvantage of these types of methods is that when there are two documents that are

similar in content but different in structure, these may be falsely grouped as belonging

to the same group. Thereby this thesis proposes to develop clustering methods for XML

documents by considering both the structure and the content features.

Often the XML documents are represented in the Vector Space Model (VSM) to be

processed for clustering [94]. VSM is a model for representing documents as vectors of

identifiers. The inclusion of both structure and content features in VSM results in very

high dimensionality for the input matrix. The application of clustering on this matrix

becomes an expensive task in terms of memory consumption and computational time for

very large datasets. To mitigate this problem, it is vital to reduce the dimensions of

the input data and create a suitable data model without compromising the accuracy of

clustering results.

This research proposes a method of utilising frequent patterns to address the dimen-

sionality explosion caused by the combination. These frequent patterns generated using

frequent pattern mining methods are used to reduce the size of the input data matrix.

These patterns also aid in the creation of an effective data model for clustering the XML

documents by capturing the relationship between their structure and content.

Further, this research uses the Tensor Space Model (TSM), a higher dimensional model,

for modelling the XML documents by directly capturing their structure and content rela-

tionships. It also proposes scalable tensor decomposition techniques to effectively analyse

the relationships between the structure and content of the XML documents and to aid in

clustering these documents. Finally, this research evaluates the output of the clustering

process, the clusters of XML documents, for information retrieval.

7

1.2 Research questions

The following research questions have been examined in this research:

� How to cluster XML documents effectively?

– How can the structure and content of XML documents be combined for im-

proving the accuracy of a clustering solution?

– How does a clustering method using both of these features perform on real-life

datasets?

– How does the clustering method using both of these features compare with

clustering methods using a single-feature focus?

– How to handle the high dimensionality resulting due to the combination? Do

the dimensionality reduction techniques incur an information loss?

� How to utilise frequent patterns to control the dimensionality of the combination of

structure and content features?

– Are the state-of-the-art frequent pattern mining methods scalable for large-

sized real-life XML datasets? If not, how to improve the efficiency and the

effectiveness of these methods?

1.3 Research aims & objectives

The objective of this research is to provide methods for more effective and meaningful

grouping of XML documents. To achieve this objective, the concept of clustering is utilised

to group similar XML documents based on the common structure and content that they

share. The high dimensionality due to the combination of these two features is reduced

by employing the XML frequent mining methods proposed in this research.

8

The proposed research can be broken down into two separate tasks, namely:

� XML frequent pattern mining : Develop frequent pattern mining methods to

mine for concise representations of frequent subtrees from structure of XML docu-

ments represented as trees. Generate different types of subtrees with a parent-child

relationship or with an ancestor-descendant relationship to identify hidden similar-

ities based on the relationships. Also, analyse the effectiveness of these subtrees in

capturing the structural commonalities for use in clustering.

� XML clustering : Develop hybrid clustering methods to group the documents

based on the content corresponding to the concise frequent structures in each doc-

ument. Explore different forms of representation such as implicit and explicit by

use of Vector Space Model and higher order models such as Tensor Space Model

respectively. Also, to analyse the suitability of the different models for clustering

various types of XML documents collection.

1.4 Research significance and contributions

As XML has become a popular standard for data exchange on the Internet, it is essential

to store XML documents effectively to facilitate easy management of the XML documents.

Hence, this research work contributes to the existing body of literature by providing more

accurate and better grouping using clustering methods that could be useful on real-life

datasets. By using not only the structure but also the content similarity among the XML

documents, the accuracy of the clustering solution can be improved.

This research makes important contributions to clustering by proposing two novel

approaches of non-linearly combining the structure and the content of XML documents

using implicit and explicit combinations. The proposed clustering methods are extensively

9

evaluated on real-life datasets to analyse their performance on these datasets and to un-

derstand the suitability of the proposed clustering methods in practice. The results were

also evaluated on the collection selection problem in information retrieval which is based

on query results from manual assessors.

Moreover, in contrast to the previous research in XML clustering, this research con-

tributes by proposing a novel way of using a high dimensional data model, the Tensor

Space Model (TSM) [63], to explicitly capture non-linearly both the structure and the

content of XML documents. It also provides an incremental decomposition technique to

effectively decompose large-sized dense tensors efficiently and effectively.

This research makes a vital contribution by efficiently reducing the dimensionality of

the dataset for clustering by utilising only the frequent patterns in the XML documents

as well as their corresponding content. This research also attempts to bridge the gaps

in frequent pattern mining by proposing various types of concise frequent pattern mining

methods to generate frequent subtrees based on the node relationship and conciseness.

This research also evaluates the proposed frequent pattern mining methods against several

state-of-the-art methods to show their strengths and weaknesses in both synthetic and

real-life datasets. Additionally, the effectiveness of the proposed frequent pattern mining

methods were evaluated on clustering.

By converging two parallel fields in data mining, frequent pattern mining and clus-

tering, this research has enriched the knowledge as well as bridged the gaps in these two

fields.

10

1.5 Thesis overview

This thesis is designed to explore the use of combining the structure and the content

features in XML documents for clustering. The study will primarily focus on developing

clustering methods for XML documents to identify interesting knowledge effectively. The

secondary focus of this thesis is to develop frequent pattern mining methods for reducing

the dimensionality of the input matrix for clustering. Furthermore, this thesis attempts

to evaluate the possibility of enhancing information retrieval by the use of data mining

outputs.

The remainder of the thesis is organised as follows:

� Chapter 2 reviews recent developments in the main topics in this research in both

clustering and frequent pattern mining. It also includes XML and the data models

for XML mining. This chapter helps to gain an insight in the methods used to iden-

tify the weaknesses in the state-of-the-art methods in both clustering and frequent

pattern mining. This has helped to identify the research gaps in both of these tasks.

� Chapter 3 describes the research design used in this thesis including the experimen-

tal design, detailed description of the datasets and the various evaluation metrics

used to evaluate the frequent pattern mining and clustering methods. It also dis-

cusses the benchmarks used to evaluate the proposed methods.

� Chapter 4 covers the proposed frequent pattern mining methods for the purpose

of clustering. It includes details about the prefix-based pattern growth approach

of generating the different types of concise frequent subtrees. While doing so, it

details the various techniques for efficiently mining the concise frequent subtrees. It

also proposes new types of concise subtrees suitable for mining very large and dense

datasets. Finally, it presents the empirical evaluation on both synthetic and real-life

11

datasets and analyses the results.

� Chapter 5 presents the proposed methods for combining the structure and the

content of the XML documents. It begins with the proposal of the hybrid clustering

methodology called the Hybrid Clustering of XML documents (HCX) for non-linearly

combining the structure and the content features in XML documents. The Hybrid

Clustering of XML documents using VSM (HCX-V) uses the implicit combination in

this methodology; the Hybrid Clustering of XML documents using the Tensor Space

Model (TSM) (HCX-T) uses the explicit combination. The proposed clustering

methods are evaluated using the metrics defined in Chapter 3. The empirical results

and the analysis of the experiments for the proposed method against the benchmarks

conducted on the various datasets are also covered in this chapter.

� Chapter 6 presents the final conclusions, summarises the findings, and lists the

main contributions of the work developed in this thesis. A few research extensions

from this work are also identified.

12

Chapter 2

Background and Literature Review

The main focus of this chapter is two fold: to provide the background knowledge and

to present a critical review of the related work relevant to the two pattern mining tasks,

namely frequent pattern mining and clustering on XML data. This chapter begins with

an introduction about XML data to provide an overview of the data domain considered

in this research. Section 2.1 on XML begins by introducing the concept of XML data

to explain the difference between its counterparts such as text and unstructured data in

regards to data mining. Section 2.2 describes the various data models that have been used

for modelling XML for mining. There exists several data models such as Vector Space

Model (VSM), paths, trees and graphs and each of them are discussed in detail in this

section.

Further, this chapter analyses the various XML clustering methods to date according

to the data models, the similarity measures and the methods that are used for clustering in

Section 2.3. Finally, the related works pertaining to the pre-processing step for clustering,

that is frequent pattern mining, are covered in detail. This chapter also provides a review

of the literature related to the various frequent pattern mining methods based on the

different data models and frequent pattern generation techniques. This chapter concludes

13

by presenting the limitations of the related works in identifying research gaps that are

required to be addressed in this research.

2.1 XML

The eXtensible Markup Language (XML) is a markup language defined by the World

Wide Web Consortium (W3C) for improved data representation and exchange over its

predecessors. XML is a simplified form of the Standard Generalized Markup Language

(SGML) [1, 98, 62]. SGML is a notation that has been widely used for a number of years for

professional document preparation [62]. However, it was found to be cumbersome, causing

difficulties in learning SGML, as it attempted to provide many features and flexibilities.

One example of SGML’s flexibility is that it allowed the absence of end tags based on the

context. These disadvantages have resulted in the development of XML which was simpler

and less flexible than SGML.

XML differs from the popular HyperText Markup Language (HTML). XML is used

to describe the content, while HTML is used to describe the format and display of the

same content. Apart from this, XML allows for user-defined tags and hence has a much

more flexible structure than HTML, which uses only pre-defined tags. Using the user-

defined tags, XML could specify not only the data but also the structure of the data.

These tags also help to create nesting to show how various elements present in XML data

are integrated into other elements. Due to this, XML data are often referred to as self-

describing. Also, the XML data can be represented in a common data format which helps

the processing and displaying of these data in an application and platform independent

way. Sol [99] has highlighted the four major benefits of using XML language:

� XML separates data from presentation which means making changes to the display

14

of data does not affect the XML data;

� Searching for data in XML documents becomes easier as search engines can parse

the description-bearing tags of the XML documents;

� An XML tag is human readable; even a person with no knowledge of XML language

can still read an XML document; and

� Complex structures and relations of data can be encoded using XML.

There are two types of XML data: XML schema definition and XML document as

shown in Figure 2.1. An XML schema definition contains the structure and data definitions

of XML documents [2]. An XML document, on the other hand, is an instance of the XML

schema that contains the data content represented in a structured format.

XML Data

XML Document XML Schema

Structure Content DTD XSD

Ill-formed Well-formed Valid

Figure 2.1: Classification of XML data

The provision of the XML schema definition with XML documents makes it different

from the other types of semi-structured data such as HTML and BibTeX. The schema

imposes restrictions on the syntax and structure of XML documents. The two most

popular XML document schema languages are Document Type Definition (DTD) and

XML-Schema Definition (XSD). XSD, an enhancement of DTD, consists of the following

features:

15

� Extensibility to future additions;

� Greater richness and usefulness;

� Ease for learning as XSD is written in XML;

� Wider range of data type support; and

� New features such as namespace support.

Figures 2.2 and 2.3 show DTD and XSD examples respectively. An XML document,

on the other hand, is an instance of the XML schema that contains the data content

represented in a structural format. An example of a simple XML document conforming

to the schemas from Figures 2.2 and 2.3 is shown in Figure 2.4. An XML document

be either ill-formed, well-formed, or valid, according to how it abides the XML schema

definition. An ill-formed document does not have a fixed structure, which implies that

it does not conform to the XML syntax rules such as lack of XML declaration statement

and it contains more than one root element. A well-formed document conforms to the

XML syntax rules and may have a document schema but the document does not conform

to it. It contains exactly one root element, and sub-elements are properly nested within

each other. Finally, a valid XML document is a well-formed document which conforms to

a specified XML schema definition [100].

Each XML document can be divided into two parts – markup constructs and content.

A markup construct consists of the characters that are marked up using “<” and “/>”.

The content is the set of characters that is not included in the markup. There are two types

of markup constructs, tags (or elements) and attributes. Tags are the markup constructs

which begin with a start tag “<” and end with an end tag “/>”, such as conf, title, year

and editor for the conf.xml document in Figure 2.4. On the other hand, the attributes

are markup constructs consisting of a name/value pair that exist within a start-tag. In

16

<!ELEMENT conf(id,title,year, editor?, paper*)> <!ATTLIST conf id ID #REQUIRED> <!ELEMENT title (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT editor (person*)> <!ELEMENT paper (title,author,references?)> <!ELEMENT author (person*)>

<!ELEMENT person(name,email) <!ELEMENT name(#PCDATA)> <!ELEMENT email(#PCDATA)> <!ELEMENT references (paper*)>

Figure 2.2: Sample DTD (conf.dtd)

the running example, id is the attribute and SIAM10 is its value. Examples of content

are “SIAM Data Mining Conference”, “2010” and “Bing Liu”.

Due to the widespread use of XML documents for various applications such as data

transformation, integration and retrieval, there has been a great deal of interest in obtain-

ing useful information from a large collection of XML by mining the data [81]. Mining on

XML data can be broadly classified into four major categories namely XML classification

[136], XML clustering [86, 104, 136], XML frequent pattern mining [54, 107, 122, 138] and

XML association rules mining (or link analysis) [17, 116]. Among these, XML clustering

and XML frequent pattern mining tasks have been more popular amongst researchers due

to their usability in varied application domains. Also, XML frequent pattern mining is one

of the first task in generating association rules. In the following sections, the literature on

these two popular data mining tasks, frequent pattern mining and clustering, which are

also the focus of this thesis will be reviewed.

Before going into the details of the mining tasks, it is essential to review the literature

on data models to understand how XML documents could be effectively represented for

mining. The following section reviews the various data models that have been used for

mining XML documents.

17

<?xml version="1.0" encoding="UTF-8"?><xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema,targetNamespace=http://www.conferences.org,xmlns=http://www.conferences.org,elementFormDefault="qualified">

<xsd:element name="conf"><xsd:complexType> <xsd:sequence>

<xsd:element ref="id" minOccurs="1" maxOccurs= "1"/> <xsd:element ref="title" minOccurs="1" maxOccurs= "1"/> <xsd:element ref="year" minOccurs="1" maxOccurs= "1"/> <xsd:element ref="editor" minOccurs="0" maxOccurs= "unbounded"/> <xsd:element ref="paper" minOccurs="1" maxOccurs= "unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element>

<xsd:element name="editor"><xsd:complexType><xsd:sequence>

<xsd:element ref="person" minOccurs="1" maxOccurs="unbounded"/></xsd:sequence> </xsd:complexType> </xsd:element>

<xsd:element name="paper"> <xsd:complexType> <xsd:sequence> <xsd:element ref="title" minOccurs="1" maxOccurs="1"/> <xsd:element ref="author" minOccurs="1" maxOccurs="unbounded"/> <xsd:element ref="references" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element><xsd:element name="author">

<xsd:complexType><xsd:sequence>

<xsd:element ref="person" minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element><xsd:element name="person">


<xsd:element ref="name" minOccurs="1" maxOccurs="1"/> <xsd:element ref="email" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element><xsd:element name="references">


<xsd:element ref="paper" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element>

<xsd:element name="id" type="xsd:string"/><xsd:element name="title" type="xsd:string"/><xsd:element name="name" type="xsd:string"/><xsd:element name="email" type="xsd:string"/></xsd:schema>

Figure 2.3: Sample XSD (conf.xsd)

18

<?xml version="1.0"?><!DOCTYPE conf SYSTEM "conf.dtd"><conf id=”SIAM10”>

<title> SIAM Data Mining Conference</title><year> 2010 </year><editor>

<person> <name>”Bing Liu”</name>

<email>[email protected]</email> </person>

</editor><paper><title>”MACH: Fast Randomized Tensor Decompositions”</title><author>

<person> <name>” Charalampos E. Tsourakakis ”> </name> <email>”[email protected]”</email>

</person></author><references><paper>

<title>” Unsupervised multiway data analysis: A literature survey ” </title> <author>

<person> <name>”Acar E”> </name>

<email>” [email protected]”</email> </person></author><author>

<person> <name>”Yener B”> </name>

<email>” [email protected]”</email> </person>

</author></paper></references>

</paper></conf>

Figure 2.4: Sample XML document (conf.xml)

19

2.2 Data models for XML mining

To suit the objectives and the needs of XML mining methods, XML data has been repre-

sented in various models. The XML data models can be classified into four major categories

namely Vector Space Model, graph, tree and path models as illustrated in Figure 2.5.

XML Data

XML SchemaXML Document

TreeContent Structure

Vector Space Model Tree Path

Graph Path

Graph

Figure 2.5: Classification of XML models

2.2.1 Vector Space Model (VSM)

The Vector Space Model (VSM) was initially proposed as a model for representing text

documents or objects such as vectors of features. It has been used widely in both frequent

pattern mining and clustering for modelling XML documents [17, 116, 36, 114]. When

using the VSM model for XML mining, the feature of the document structure is its sub-

structure, which can be a tag, subpath, subtree, or subgraph. The feature of the document

content is its term, which is a pre-processed word.

In VSM, each of the documents is represented as a feature vector containing either

binary (0 or 1), frequency or weighted values. There are two ways of representing the

XML document in VSM – dense and sparse. In the dense VSM representation for XML

documents collection, each document vector is represented for every feature in the collec-

tion. If it does not contain the feature then the document vector for that feature has a

20

f1 f2 f3 f4 f5 f6

d1 1 1 0 2 6 0d2 0 0 0 2 0 0d3 0 0 3 0 0 1

d1 1 1 2 1 4 2 5 6d2 2 1d3 3 3 6 1

(a) (b)Figure 2.6: An example of (a) a dense representation; (b) a sparse representation ofan XML dataset modelled in VSM using their feature frequency

value of 0. The sparse VSM representation retains only the non-zero values along with

the feature (tag or term) id. This improves the efficiency in computation especially for

sparse datasets where the number of non-zeroes is less compared to the number of zeroes.

Figure 2.6 gives examples of a dense and a sparse representation using only the frequency

of the feature. It shows that the size of the sparse representation is smaller than the dense

representation. Mining using the sparse representation is more efficient when the number

of zeroes is more; however, if there is a greater number of non-zeroes then the sparse

representation could incur in additional overhead, as the feature indices are also stored.

Table 2.1 shows a sparse VSM using the tags of the sample XML document given in

Figure 2.4. It can be seen that this model is not only simple but also enables an easy

representation of data and allows simpler management and analysis of the data. However,

a serious problem arises due to the nature of this model when it is used in mining, since

it does not consider the hierarchical relationship between the tags of the XML document.

Consequently, mining methods using this model for XML data representation can provide

inaccurate results.

Table 2.1: VSM generated from the structure of XML document given in Figure 2.4

Document Id Tags

1 conf 1 title 3 year 1 editor 1 person 4name 2 email 4 paper 2 author 3 reference 1

Consider, the example fragments from the sample document given in Figure 2.7,

the fragment <name>“Bing Liu”</name> in Figure 2.7 (a) and the fragment <name>

21

“Charalampos E. Tsourakakis” </name> in (b). Both the fragments contain the same

tag set; however, the former fragment refers to editor’s name and the latter to author’s

name.

<editor><person><name>”Bing Liu”</name><email>[email protected]</email>

</person></editor>

<author><person><name>” Charalampos E. Tsourakakis ”> </name><email>”[email protected]”</email>

</person></author>

(a) (b)

Figure 2.7: Sample XML fragments

When frequent pattern mining was applied on the XML tags, then <name></name>

will be output as a frequent structure. In reality, the tag <name></name> is not a

frequent structure as its parents are different. Hence, to avoid inaccurate results, it is

essential to consider the hierarchical relationships among the data items, not just among

the tag names.

VSM models the content of the XML document similar to the tag representation. Some

of the common pre-processing techniques such as stop-word removal, stemming [89, 91]

are applied on the content to identify the unique words to be represented in VSM. An

example is shown in Table 2.2 using the XML document in Figure 2.4, which shows that

the words such as “on”, “for”, “a” and “of” are removed as they are the stop words. Then,

the stemmed words are generated using stemming techniques. The words do not include

the tag names in the XML document; therefore, it is not clear whether the name “bing”

refers to an author or an editor of the paper. This may result in imprecise knowledge

discovery.

Thus, it is essential to include not only the hierarchical structural relationships among

the data items but also the content while mining for XML documents. One such model

22

Table 2.2: Transactional data model generated from the content of XML documentgiven in Figure 2.4

Document Id Tags

1 siam10 1 siam 1 data 1 mineng 1 2010 1 bing 1 liu 1liub 1 mach 1 fast 1 random 1 tensor 1decomposition 1 charalampos 1 tsourakakis 1ctsourak 1 cmu 1 unsupervised 1 multiwai 1data 1 analysis 1 literature 1 survei 1acar 1 acare 1 cs 3 rpi 2 edu 3 yener 2

which preserves the hierarchical information can be found in graphs. The following sub-

section will provide details of using this model.

2.2.2 Graphs

Due to the extensive research in graph mining methods for semi-structured data, many of

these methods are used for XML mining [5]. In these methods, the XML tags along with

their hierarchical relationships are modelled as graphs.

Graph: A graph can be defined as a triple (V, E, f) where V represent the set of nodes

(or vertices) and an edge set E with a mapping function f: E → V × V. The nodes are

the elements in XML documents and the edge set E consists of the links that connect the

nodes in order to present parent-child relationships.

As illustrated in Figure 2.8, there are different types of graphs, among which the

popular types are :

1. labelled graph or unlabelled graph;

2. directed or undirected graph;

3. cyclic or acyclic; and

4. connected or disconnected graph.

A labelled graph denoted by (V, E, f, Σ, L) contains an additional alphabet Σ which

23

(a) (b) (c)

A

R

B C

P

Q

S U T

F

E

D

Figure 2.8: (a) A graph; (b) a labelled graph; (c) a directed graph

represents the node (P, Q, R, S, T, U) and additional edge labels (A, B, C, D, E, F) with

a labelling function to assign the labels to the vertices and edges. A graph is directed or

undirected if it indicates the order in which the vertices are connected or disconnected

with each other and with the edge labels. A cyclic graph is the one in which the first and

last vertices in the path are the same. If all the vertices are connected with at least one

edge in such a way that there exists a path from every node to any node in the graph then

it is a connected graph, otherwise it is unconnected.

Often graph models are used in representing the schema of an XML document rather

than representing the XML document itself due to the presence of a cyclic relationship in

a schema. The labelled graph representation of the schema shown in Figure 2.2 is given in

Figure 2.9 where ovals represent the nodes with the labels and circles for nodes without

any labels. It can be noted from the graph representation that there is a cyclic reference

to the element ‘paper’ from the element ‘reference’.

2.2.3 Trees

Often XML documents occur naturally as trees or can easily be converted into trees using

node-splitting methodology [9]. Trees are a form of graphs that is acyclic (no cycles) and

24

year

conf

person

?

author

* +

name email

title

editor

*

paper

reference

? * title

id

Figure 2.9: Graph representation of conf.dtd

is connected graphs.

Tree: A tree is denoted as T = (V, v0, E, f), where (1) V is the set of nodes; (2)v0 is

the root node which does not have any edges entering in it; (3) E is the set of edges in the

tree T ; (4) f is a mapping function f: E → V × V.

If there exists two trees T = (V, v0, E, f) and T ′ = (V′,v1, E′, f), T and T ′ are

isomorphic, written as T ∼= T ′ to each other if there exists a node mapping bijective

function f: V → V′ such that (v0,v1)∈ E ⇔ (f(v0), f(v1)) ∈ E′. Such a map f is called an

isomorphism [34]. Hence, the two labelled trees T and T ′ are isomorphic to each other if

an one-to-one mapping from T to T ′ exists that preserves the root, node labels, and both

the adjacency and the non-adjacency of the vertices.

XML Parsers are used to extract either structure or the content features or both from

the XML documents. Among the XML parsers, Document Object Model (DOM) parsers

can be used to derive the tree-structure (a node tree) with the elements, attributes, and

text defined as nodes for the given XML documents. XML documents can be modelled

25

conf

title references

year

SIAM DataMining Conference

Bing Liu [email protected]

Charalampos E,Tsourakakis

[email protected]

Evrim Acaracare@

cs.rpi.edu

MACH: FastRandomized

Tensor Decomposition

2010

Unsupervisedmultiway dataanalysis: Aliteraturesurvey

papereditor

person

name email

title

name email

author

person

name email

author

person

title

Figure 2.10: Tree representation of the XML document given in Figure 2.4

as unranked or ordered labelled trees where labels correspond to XML tags which may

or may not carry semantic information. More precisely, when considering only the tree

structure, it should be noted that each node can have an arbitrary number of children,

the children of a given node are ordered and each node has a label in the vocabulary of

the tags [19].

For example, using the example XML document shown in Figure 2.4, a tree-based

model for this XML document can be derived as shown in Figure 2.10. In this figure,

the oval shapes indicate the tags and the terms are represented using rectangular boxes.

This tree representation contains both structure and content of XML documents with the

leaf node representing the terms. Therefore, to represent only the structure of the XML

documents, the terms are removed in which case the leaf node represents a tag.

26

2.2.4 Paths

The XML elements can also be modelled as paths which maintain the hierarchical rela-

tionship among their nodes.

Path: Let there be a tree T = (V,v0, E, f), a path P in it with length j is a sequential

expression of the nodes, given by (v0, v1, . . . , vj) where (vj−1, vj) ∈ E.

A path could be either a partial path or a complete path. A partial path contains the

edges in sequential order from the root node to a node in the document; a complete path

(or unique path) contains the edges in sequential order from the root node to a leaf node.

A leaf node is a node that encloses the content or text. In Figure 2.11, (a) and (b) show

an example of a complete and a partial path model respectively for the structure of the

XML document using its tags. A complete path can have more than one partial path with

varying lengths.

name

Bing Liu

conf

editor

person

conf

editor

person

name

conf

editor

person

(a) (b) (c)

Figure 2.11: Paths derived from XML document model (in Figure 2.4): (a) A completepath; (b) a partial path; (c) a complete path with text node

As shown in Figure 2.11(c), paths can be used to model both the structure and the

content by considering the leaf node as the text node. However, this kind of model results

in repeated paths for different text nodes. For instance, if there is another editor “Malcom

Turn” for this conference proceedings, then the path with the text node for this editor

27

will have the same path as that of the editor “Bing Liu” and the only difference will be

in the text node. Hence, to reduce the redundancy in the structure and to capture the

sibling relationship, the structure of the XML documents can be modelled as “Trees” or

“Graphs” as discussed above.

This discussion of the various data models which represent the XML data for mining

leads into the following sections, which will look into the two major mining focus areas in

this research: clustering and frequent pattern mining.

2.3 XML clustering

The increased use of XML documents for data representation and exchange has attracted

a great deal of interest among researchers for efficient data management and retrieval [67].

Clustering has been perceived by the research community as a task for offering an efficient

data management solution [104]. The clustering process of XML data plays a dominant

role in many data applications such as information retrieval, data integration, document

classification, web mining and query processing [111].

Clustering is used to explore interrelationships among a collection of documents which

results in homogeneous clusters [64]. In these homogeneous clusters, the documents within

one cluster are more similar to each other than the documents belonging to a different

cluster. Clustering on XML documents can be performed by exploring the interrela-

tionships using the features inherent in these documents. This could be based on their

structure features or content features or a combination of structure and content features.

Most of the previous works on XML clustering focus on utilising either the structure

[4, 27, 50, 110, 60, 69, 114] or the content of XML documents [129, 36, 111]. Clustering

using both structure and content features has received significant attention recently in an

28

attempt to improve the accuracy of the clustering solution. In this subsection, the related

works for clustering, based on their features, will be presented. Furthermore, the related

works on each of these features are organised according to the data models (as described

in Section 2.2) used for representing them for clustering.

2.3.1 Based on structure

Due to the semi-structured nature of XML documents, there have been an extensive

number of methods proposing clustering XML documents based on their structures. Since

it was considered that documents having similar structure belong to the same group and

this was also considered as the only feature required for clustering, these methods can

therefore be divided according to the data model used to model them for clustering.

2.3.1.1 Vector Space Model (VSM)

Though VSM is a popular model for representing XML documents based on their content

especially for text-centric documents, it has also been used to model the structure of

these documents. In the work by Doucet and Lehtonen [36], vectors representing the tag

names for the documents have been used to represent the XML documents for clustering.

However, as discussed before, this model ignores the hierarchical relationship between the

tag names. The work by Vercoustre et al. [114] models the XML documents in VSM

by using the frequency of the paths that are present in it. On the contrary, Leung et

al. [73] identify the common paths in the document collection and uses it to represent

the XML documents as a boolean vector of these common paths. A similar clustering

method, Closed Frequent Structures-based Progressive Clustering (CFSPC) [69], utilises

VSM to represent the common subtrees as a boolean vector. Though this model is simple

to represent, it could suffer from the typical disadvantages of boolean vectors such as the

29

absence of partial matching and difficulties in identifying the ranking between documents

which have similar substructures but varying lengths.

Once the XML dataset is represented in VSM, similarity between a pair of documents

can be measured using various distance measures such as Cosine, Euclidean, Manhat-

tan, Jaccard, Dice, Simple matching and Overlap [33]. A comprehensive survey of these

measures can be found in [20]. The most common similarity measure for calculating the

distance between vectors is a cosine measure. The cosine between two vectors, di and dj ,

representing two XML documents is given by,

cos θ =di.dj

|di||dj |(2.1)

The cosine similarity provides the measure of the angle between the two documents to

check whether the two documents point to the same direction or not. Another similarity

distance which captures the magnitude of the documents is the Euclidean distance. It

measures the distance between the documents by measuring the length between the terms.

The Euclidean distance between two documents is given by:

Edi,dj=

√√√√ m∑k=1

(dtki − dtk

j )

2

(2.2)

The clusters produced using this distance tend to be spherical in nature. On the other

hand, Manhattan distance computes the distance between two data points in a grid-like

path. The Manhattan distance between two data points is the sum of the differences of

their corresponding components as given by:

Mdi,dj=

m∑k=1

|dtki − dtk

j | (2.3)

30

Other distance measures such as Simple matching, Overlap, Jaccard and Dice, which

are usually applied on sets, can also be used for vectors. These distance measures are

the intersection between two vectors di and dj features; however, their denominators are

different. For instance, the Overlap measure is computed by the intersection between the

two vector features over the size of the vector that has the least features, whereas the

Jaccard measure is calculated as the intersection of the two vector features over the size

of the union set of the two vector features.

2.3.1.2 Graphs

The XML clustering methods on utilising graph structure can be grouped into two types:

node clustering and graph clustering. Flake et al. [38] and Aggarwal et al. [5] give a

good overview of XML graph clustering methods. The node clustering methods attempt

to group the underlying nodes with the use of a distance (or similarity) value based on

the edges. In this case, the edges of the graph are labelled with a numerical distance

values. These numerical distance values are used to create clusters of nodes. On the other

hand, the graph clustering methods use the underlying structure as a whole and calculate

the similarity between two different graphs. This task is more challenging than the node

clustering tasks because of the need to match the structures of the underlying graphs,

and then to use these structures for clustering purposes. However, the graph clustering

methods using the underlying structure do not make any assumptions about the structure

and hence tend to provide better results [5].

A popular graph clustering method for XML documents is S-GRACE [117], which uses

a hierarchical clustering method based on the ROCK method [44] for similarity compu-

tation. S-GRACE computes the distance between two graphs by measuring the common

set of nodes and edges. Firstly, S-GRACE scans the XML documents and computes their

31

s-graphs. The s-graph of two documents is the sets of common nodes and edges. The

s-graphs of all documents are then stored in a structure table called SG, which contains

two fields of information: a bit string representing the edges of an s-graph and the ids of

all the documents whose s-graphs are represented by this bit string. Once the SG is con-

structed, clustering can be performed on the bit strings. By exploiting the link (common

neighbours) between s-graphs, the best pair of clusters are selected and then merged in a

hierarchical manner.

Another method that borrows techniques from artificial neural networks and uses

graphs for XML clustering is the Graph Self-Organizing Map (GraphSOM) [45] that allows

the encoding of the XML structure in the form of graphs. As the GraphSOMs require

training, they are trained to cluster XML formatted documents based on topological infor-

mation in the tags and on the type of XML tag embedded in the document. Though this

type of clustering can handle complex structures but it reported to have poor accuracy

[30, 137] on Wikipedia datasets.

2.3.1.3 Trees

Due to the complexity in graph clustering caused by the presence of cyclic relationships

between nodes, clustering XML documents using tree models has gained popularity over

the graph model. Clustering using tree models is one of the well-established fields of

XML clustering methods. Several clustering methods modelling the XML data as trees

have been developed to determine XML data similarity. The reputed methods of tree edit

distance are extended to compute the similarity between the XML documents.

The tree edit distance is based on dynamic programming techniques for a string-to-

string correction problem [115]. The tree edit distance essentially involves three edit

operations, namely changing, deleting, and inserting a node, to transform one tree into

32

another tree. The tree edit distance between two trees is the minimum cost between the

costs of all possible tree edit sequences based on a cost model. The basic intuition behind

this technique is that the XML documents with the minimum distance are likely to be

similar and hence they can be clustered together.

Some of the clustering techniques that use the tree-edit distance are Nierman and

Jagdish [87] and Dalamagas et al. [26]. In [87, 26], the tree-edit distance is used to

compute the structural similarity between each pair of documents. XML documents with

a minimum distance are considered to be similar. A study showed that XML document

clustering using tree summaries provide high accuracy for documents [26]. The structural

summaries of the XML documents were extracted and used to compute the tree-edit

distance. However, this type of similarity computation requires a quadratic number of

comparison between the elements in the documents resulting in prohibitive computational

complexity of the algorithm. Also, this similarity computation may lead to incorrect

results as the calculated tree-edit distance can be large for very similar trees conforming

to the same schema for different size trees [124]. To resolve this issue, an efficient element

similarity measure was introduced in [86] based on the level-wise similarity of the nodes.

It utilises a novel global criterion function, the LevelSim, that measures the similarity at

a clustering level utilising the hierarchical relationships between elements of documents.

The elements in different level positions are allocated different weights. By counting the

common elements that share common ancestors, the hierarchical relationships of elements

are also considered in this measure. An improvement to XCLS is XEdge [10], which uses

the same level-wise similarity but instead of nodes it applies to edge to capture more

hierarchical relationship.

There are other clustering methods which avoid modifying the tree structure as in

tree edit distance methods by breaking the paths of tree-structured data into a collection

33

of macro-path sequences where each macro-path contains a tag name, its attribute, data

types and content. A matrix similarity of XML documents is then generated based on the

macro-path similarity technique. Clustering of XML documents is then performed based

on the similarity matrix with the support of approximate tree inclusion and isomorphic

tree similarity [97]. Many other approaches have also utilised the idea of tree similarity

for XML document change detection [119] and for extracting the schema information from

an XML document, such as those proposed in Garofalakis et al. [43] and Moh et al. [79].

Besides mining the structural similarity of the whole tree, other techniques have also

been developed to mine the frequent pattern in subtrees from a collection of trees [107].

The method proposed by Termier et al. [107] consists of two steps: first it clusters the

trees based on the occurrence of the same pairs of labels in the ancestor relation using

the apriori heuristic; after the trees are clustered, a maximal common tree is computed to

measure the commonality of each cluster to all the trees.

2.3.1.4 Paths

There have been several XML clustering methods determining structural similarity based

on the paths shared between documents [73, 84]. The paths model represents the structure

of the document as a collection of paths. A clustering method measures the similarity

between XML documents by finding the common paths [83].

One of the common techniques for identifying the common paths is to apply frequent

pattern mining on the collection of paths to extract the frequent paths of a constrained

length and to use these frequent paths as representatives for the cluster. This technique

has been utilised by Hwang and Ryu [50] and XProj [4]. XProj [4], clusters the XML

documents by extracting the frequent substructures in each of the clusters. XProj con-

verts a tree structure into a sequence (or path) of node labels and extracts the frequent

34

subsequence or subpaths.

Another simple method of finding XML data similarity according to common paths

is by treating the paths as a feature for the VSM model [36]. Other methods such as

XSDCluster [85], PCXSS [84] and XClust [72] adopt the concept of schema matching for

finding the similarity between paths. The path similarity is obtained by considering the

semantic and structural similarity of a pair of elements appearing in two paths. These path

measures in these methods are computationally expensive, since these measures consider

many attributes of the elements such as data type and their constraints.

As the path does not include the sibling relation between the nodes in a tree, this

type of model may result in information loss when used for clustering XML. Also, path-

based frequent pattern mining methods may fail to provide concise substructures for the

structure of XML documents with a high branching factor.

Table 2.3 summarises the different XML document clustering methods based on the

various models. This comparison will help to understand the different types of models and

the similarity measures that have been used in the literature on clustering XML documents

using their structure.

2.3.2 Based on content

There have also been several clustering methods developed that use only the content

features of XML documents. These are especially suitable for text-centric XML documents

that have less structure information and more content. Most of these clustering methods

focus on representing the content in VSM with very little focus on other data models due

to the simplicity of the VSM model.

35

Table 2.3: Comparison of different types of clustering methods using structure ofXML documents

Models Methods Similarity approach

VSM Vercoustre et al. [114] Euclidean distance on pathsLeung et al. [73] Euclidean distance on pathsCFSPC [69] Cosine similarity on

frequent patterns

Graph S-GRACE [117] distance measure basedon s-graphs

GraphSOM [45] Euclidean distance

Tree Nierman and Jagdish [87] tree-edit distanceDalamagas [26] tree-edit distance

Path XProj [4] frequent pathsDoucet and Ahonen [36] Euclidean distancePCXSS [84] path similarityXEdge [10] edge similarityXMine [82] path similarityXML C [50] path similarityXClust [72] path similarity


VSM is commonly used by XML clustering methods that focus on content features [95].

The techniques using this model utilise the content of the XML documents by treating

them as a bag of words similar to text documents and to then cluster them.

Doucet et al. [36] represent the content of the documents in VSM and apply the k-

means algorithm to group them. As for large datasets, the clustering of XML data using

all the content can be expensive due to the presence of large number of terms.

Recently, a content-based clustering method called Cover-Coefficient Based Clustering

Methodology (C3M) [7] for clustering XML documents based on its content was proposed.

It is a single-pass partitioning type clustering method which measures the probability of

selecting a document given a term that has been selected from another document. It pro-

poses two approaches – term-centric and document-centric index pruning – which generates

compact representations of the term or document indices respectively and then represent

them in VSM. Finally, the documents represented with these reduced representations in

the VSM are clustered.

36

Using only the content of the XML documents is suitable for documents which have

similar structure and hence the structure could be ignored. However, due to the prevalence

of heterogeneous XML documents in real-life datasets, the use of only the content in the

XML documents for clustering is not sufficient to provide an effective clustering solution.

In order to improve the effectiveness, researchers have resorted to utilising the semantic

dictionary WordNet to measure the synonym similarity of the keywords existing between

two documents [104]. However, this approach is not suitable for documents which contain

the same words when they are not related as they are from different themes.

The majority of these methods focus on clustering the XML documents by identifying

structure or content similarity between them. However, as pointed out earlier in the

chapter, for some datasets it becomes essential to include both the structure and the

content similarity in order to identify clusters.

2.3.3 Based on structure and content

Methods with a single-feature focus using either structure or content tend to falsely group

documents that are similar in both features. To correctly identify similarity among doc-

uments, the clustering process should use both their structure and content information.

However, most of the clustering methods using both the structure and content features

of the XML documents adopt naıve methods of combining them linearly, due to the com-

plexity inherent in the process of combining them. The following subsections detail the

different methods based on the models used.


VSM due to its simplicity can also be used to model both the structure and the con-

tent of the XML documents. A representation which links the structure and the con-

37

tent features together is the Structured Link Vector Model (SLVM) [128] that represents

both these features of XML documents using vector linking. For instance in the SLVM

model, given an XML document Di, it is defined as a matrix Di ∈ Rn×m, such that,

Di =⟨Di(1),Di(2), . . . ,Di(m)

⟩where m is the number of elements, Di(l) ∈ Rn is the TF-

IDF feature vector representing the element ei, given as Di(l) = TF (tj,Dl, ei) ∗ IDF (tj)

for all j = 1 to n, where TF (tj,Dl, ei) is the frequency of the term tj in the element ei of

Di. An improvement of this model using the concepts of LSI is the SLVM-LSI [56].

Common Rare Pattern (CRP) and 4-length Rare Pattern (4RP) clustering methods

proposed by Yao and Zerida [131] use the VSM model for representing the paths. Each

path contains the edges in sequential order from a node i to a term in the data content.

These methods create a large number of features; for instance, if there are 5 distinct terms

in a document, and 5 of these distinct terms are in two different paths, then there will be

10 different features altogether.

The clustering method by Tran et al. [109] also models the structure and the content of

the XML documents in VSM. In this model, structure and content similarity are calculated

independently for each pair of documents and then these two similarity components are

combined using a weighted linear combination approach to determine a single similarity

score between documents. This approach is based on latent semantic analysis to develop

a kernel to incorporate content as well as the structure of XML documents. However,

using this method for clustering does have three key limitations. First, the approach

combines the structure and the content linearly, which may result in poor accuracy as the

relationship between the structure and the content is not used. It also causes difficulties in

tuning the parameters for combining the two features. These methods, therefore require

extensive experiments to identify best parameters settings to achieve a good clustering

solution. Second, the computational requirements for building the kernel are high since this

38

method has to compute the pair-wise similarity between the objects that are required to be

clustered. As a result, this method can only be applied to relatively small data sets (a few

thousand documents) and cannot be used to effectively cluster real-life document corpus of

XML. Third, by relying only on the pair-wise similarities between documents, this method

tends to produce suboptimal clustering solutions especially when the similarities are low

relative to the cluster sizes. The key reason for this is that these clustering methods

can determine the overall similarity of a collection of objects (i.e., a cluster) only by

using measures derived from the pair-wise similarities (e.g., average, median, or minimum

pair-wise similarities). However, such measures, unless the overall similarity between the

members of different clusters is high, are quite unreliable because they cannot capture

what is common between the different objects in the collection [139].

As proposed in this thesis, using the common substructure of the XML documents,

their content is extracted and represented in VSM. This method helps not only to capture

the relationship between the structure and the content but also to remove the uncommon

content as they are the outliers affecting the accuracy of the clustering solution.

2.3.3.2 Trees

The Semantic XML Clustering (SemXClust) method [103] is the seminal work to cluster

semantically related XML documents by utilising both the structural information and the

content. This method represents XML documents in a set of tree tuples with the structure

and the content features enriched by the support of an ontology knowledge base to create a

set of semantically cohesive and smaller-sized documents. These tree tuples are modelled

as transactions and transactional clustering methods are then applied.

On the contrary, based on the node relationship in the tree structure of the XML

documents, different types of subtrees were identified. These subtrees are then used in

39

[67, 68] to extract the content of the XML documents. These methods do not make any

assumption on the structure among the elements of the XML documents and they also

generate structural summaries of them. Since these subtrees were not only frequent but

were also concise, the content extracted using them is precise and therefore improves the

accuracy of the clustering solution.

2.3.3.3 Paths

To capture the hierarchical structure and the content of XML documents, some researchers

[114, 131] have made attempts to include the text along with the path representation

in order to cluster XML documents using both structure and content features. Results

from their studies show that for both these methods, clustering performance (F1-measure)

degrades by including the structure in the content on the Wikipedia dataset in comparison

to representing only its content. However this is a data specific characteristic and the

authors [131] extract all the paths and their corresponding content and hence it results in

an explosion in the dimension. However, it is essential to utilise the relationship between

the structure and the content in clustering.

Approaches using the linear combination of structure and content in VSM or using

paths often fail to scale for even small collections of a few hundred documents, and in some

situations this has resulted in poor accuracy [114]. A linear combination of structure and

content features of XML documents cannot perform effectively, since the mapping between

the structure and its corresponding content is lost. The content and structure features

inherent in an XML document should be modelled in a way that the mapping between

the content of the path or tree can be preserved and used in further analysis. One such

model is a multi-dimensional model combining these two features.

There has been limited research on clustering using multi-dimensional aspects of the

40

documents due to the complexity and the explosion of data produced as a result of com-

bining these multi-dimensional features. The complexity becomes worse when dealing

with large-sized datasets, although a BitCube representation [133] was used to cluster and

query the XML documents with paths, words and documents as the three dimensions of

the BitCube. Each entry in the BitCube presents either the presence or absence of a given

word in the path of a document. The XML document collection has been first partitioned,

using the top down approach, into small bitcubes based on the paths and then the smaller

bitcubes have been clustered using the bit-wise distance and their popularity measures.

However, this method used all the paths in the XML documents and hence might incur a

heavy computational complexity for a very large collection of documents.

2.3.3.4 Tensor Space Model (TSM)

The Tensor Space Model (TSM) helps to alleviate the disadvantages inherent in the VSM

by directly preserving the relationship between the structure and the content of the XML

documents. In the TSM, the content corresponding to its structure is stored and hence

the model could help to analyse the relationship between the structure and the content.

Traditionally, tensors have been widely used in physics for stress and strain analysis.

TSMs have been successfully used in representing and analysing multi-dimensional data

in signal processing [80], web mining [78] and many other fields [112]. Tensor clustering is

a multi-way data analysis task which is currently gaining importance in the data mining

community. The simplest tensor clustering scenario, co-clustering or bi-clustering, where

two dimensions are simultaneously clustered, is well established [8, 104]. Authors in [14]

proposed a method for multi-way clustering on tensors by extending the co-clustering

technique from matrices to tensors using relational graphs. Also, recently the approxi-

mation based Combination Tensor Clustering method [55] was proposed which clusters

41

along each of the dimensions and then the cluster centres are represented in the tensor.

These co-clustering techniques capture only the 2-way relationships among the features

and ignore the dependence of multiple dimensions in clustering, which may result in a loss

of information while grouping the objects.

Several decomposition algorithms have been developed [37] to analyse the TSM and

to derive correlations and relationships from different features represented in TSM. There

are two broad families of tensor decompositions, namely CANDECOMP/PARAFAC (CP)

[61] and TUCKER [113]. CP is a higher-order analogue of Singular Value Decomposition

(SVD) or Principal Component Analysis (PCA). The CP solutions are not unique due to

the heavy dependence of CP on initial guess, whereas HOSVD and Tucker tend to provide

unique solutions. Table 2.4 presents a summary of other tensor decomposition algorithms

based on CP and Tucker. Acar and Yener [37] and Kolda and Bader [63] present a detailed

survey of all these decomposition algorithms.

Table 2.4: Popular tensor decomposition algorithms based on CP and Tucker

Based on Other tensor decompositions

Tucker Memory Efficient Tucker (MET)MACHHigh-Order Singular Value Decomposition (HOSVD)High-Order Orthogonal Iteration (HOOI)

CP Individual Differences in Scaling (INDSCAL)Implicit Slice Canonical Decomposition (IMSCAND)

Recently, the Incremental Tensor Analysis (ITA) methods [102] such as STA (Stream-

ing Tensor Analysis), DTA (Dynamic Tensor Analysis) and WTA (Window-based Tensor

Analysis) were also proposed to deal with large datasets. These methods are efficient in

decomposing sparse tensors (density ≤ 0.001%). However, when large dimensions exists

or if the tensor is dense, then these decomposition techniques fail to decompose. Real life

XML documents when represented in TSM are dense with about 127M (where “M” denotes

Million) non-zero entries with over 1M terms. Hence, these decomposition algorithms can-

not be applied. A memory-efficient implementation of Tucker, MET [58], was proposed to

42

avoid the intermediate blow-out in tensor factorisation. Recently, a random decomposition

technique, MACH [112] was proposed to be suitable for large dense datasets. The number

of entries in the tensor is randomly reduced using Achlioptas-McSherry’s technique [3] to

decrease the density of the dataset. When MACH is used in clustering dense datasets in

an attempt to reduce the density, MACH might ignore the documents with smaller length

because these documents could be grouped into a single cluster, in spite of differences in

their structure and content.

It can clearly be seen that there are several contributions to the research on TSM but

only limited research exists on using TSM for clustering the XML documents. To the best

of the author’s knowledge, authors in [96] have only applied tensor clustering in a semi-

structured documents dataset using IMSCAND (Implicit Slice Canonical Decomposition).

They utilised six pre-defined similarity values in a tensor model to group bibliographic

data such as similarity among words in abstracts, between names of authors, keywords,

words in the title, co-citations and co-reference information. This type of assumption

prevents IMSCAND from applying on different types of XML document collection. In

contrast, this research conducts tensor clustering on XML documents without any prior

assumption, which is appropriate in the case of clustering a large number of documents

with a diverse nature.

2.3.4 Research gaps in XML clustering

A review of the literature has revealed that the following research gaps in XML Clustering

exist:

� Most of the existing methods [110, 4, 114, 73, 65] use either the structure or the

content of the XML documents for clustering but not both.

43

� Techniques which attempt to cluster the XML documents using both of these features

fail to scale for very large datasets [114] or result in poor accuracy due to the linear

combination [109, 36, 132].

� Most of the clustering methods [36, 109, 104] have relied on the two-dimensional

VSM for grouping the XML documents, with limited research on using TSM [96] for

clustering.

In order to address these research gaps in the clustering of XML documents using

both the structure and the content of the XML documents, a feature selection method is

required to reduce the dimensionality of the combination. Also, utilising all the structure

and the content features of the XML documents is infeasible for very large number of

documents. Hence, it is essential to identify not all but only common patterns among

these features in XML documents. One of the popular technique for identifying common

patterns is frequent pattern mining. Frequent pattern mining has already been used as a

kernel function in classification [136], clustering [4], association rules and sequential rules.

The following section reviews the research works on frequent pattern mining.

2.4 Frequent pattern mining

Frequent pattern mining was first introduced along with association rules mining [6] to

analyse customer-buying behaviour from retail transaction databases. Frequent pattern

mining in these databases involves identifying patterns that occur quite often, and hence

these patterns are called frequent. In general, frequent patterns are a set of items or item-

sets in transactional databases. The frequent patterns are called subsequences, subtrees

and subgraphs, when extracted from sequential databases, trees and graphs respectively.

This section briefly introduces the frequent pattern mining on XML documents, presents

44

the various data models that have been used for frequent pattern mining and the methods

that have used to generate frequent patterns that will be useful for clustering.

2.4.1 An overview

Frequent pattern mining on XML documents involves identifying the common patterns

based on an user-defined support threshold often referred to as minimum support denoted

by (min supp). The frequent pattern mining can be defined as follows:

Given an XML dataset (a collection of XML documents) D = {D1,D2, ...,Dn}, find

frequent patterns P = {p1, p2, . . . , pr}, such that for every pi ∈ P , freq(pi) >= min supp

where freq(pi) is the percentage of documents in D that contain pi.

Due to the simplicity of this mining task, there have been several works conducted

on frequent pattern mining. Figure 2.12 presents a taxonomy of frequent pattern mining

based on the features of the XML documents.

XML Frequent Pattern Mining

Structure mining Content mining Structure and Content Mining

Intra-Structure mining

Inter-Structure mining Content analysis Structural

clarification

Figure 2.12: Hierarchy of XML frequent pattern mining

XML Frequent Structure Mining

Mining of the frequent structures can be broadly divided into two categories, namely

intra-structure mining or inter-structure mining (refer to Figure 2.12). Intra-structure

mining deals with finding information of structure within an XML document. Knowledge

45

is gained concerning the internal structure of XML documents, that is, their schema

definitions. On the contrary, the inter-structure mining is concerned with mining on a set

of documents and identifying frequent substructures among them. As XML documents are

often viewed as trees or graphs due to their hierarchical structure, the result of frequent

pattern mining on their structure will be a set of subtrees or subgraphs as in <employee>

tag often contains <salary> tag. This information could be used in the information

retrieval domain to quickly locate the salary details when queried.

XML Frequent Content Mining

The content of XML documents basically refers to the text between the start and

the end tag. For instance, in <Title> Data Mining </Title>, content refers to the text

“Data Mining” between the start tag <Title> and the end tag </Title>. Based on the

purpose, XML frequent content mining is classified into content analysis and structural

clarification. Content analysis is similar to the relational database mining for identifying

a frequently occurring instance of a relation. For instance, it can be used to identify

frequently occurring items in a transaction. On the other hand, the structural clarification

helps to distinguish two structurally similar documents based on their contents such as

homographs.

XML Frequent Structure and Content Mining

Apart from mining structure and content separately, a new category of mining the

structure and content together was introduced in [66]. To apply frequent pattern mining

on both these features, either the partial or full structure of XML documents along with

its content could be used. Hence, in contrast to content mining, the structure and content

mining methods do not enforce the restriction that the structure of the document should

be fixed or constant. Similar to content mining, the combined structure and content

frequent pattern mining helps to provide structural clarification and content analysis.

46

However, a number of challenges exist while mining for the combined structure and content

information. The major challenge is that the mining method should be highly scalable

as the file size will be huge due to the storage of both the content and the structure

information, thus causing scalability issues for the mining methods.

Based on the previous discussions, it is clear that applying frequent pattern mining

on both the features is not useful, as this will also result in a huge explosion of data. As

the structure has been used to store the content, the dimensionality reduction could be

achieved by applying frequent pattern mining on the structure of the XML documents.

Hence, the following subsections will review frequent pattern mining literature based on

the structure of the XML documents using the various models discussed in Section 2.2.

2.4.2 Frequent pattern mining methods

The perception of the model of XML documents for frequent pattern mining forms the

basis for how it is mined for frequent patterns. For instance, if the XML document is

viewed as a vector of tags then the application of frequent pattern mining results in a

list of frequently occurring tags. Another view of the XML documents is graphs or trees,

where each XML document represented in a string format corresponds to the transactions.

The application of frequent pattern mining on trees or graphs results in frequent subtrees

or subgraphs. This subsection will look into frequent pattern mining methods based on

the structure of the XML documents.


Inspired from the frequent itemset mining, XML data is represented as a vector of tags in

a sparse VSM as it mimics the characteristics of the transaction database which records

the items and its occurrences. However, applying frequent pattern mining on the sparse

47

VSM representation of XML documents uses the standard techniques for frequent itemset

mining and is not specific to XML data, and it therefore loses the relationship between

the tags.

In the sparse VSM, each XML document in the dataset is represented based on either

its tags or its content. The frequent pattern mining method begins with a complete scan

of the tags in the VSM to identify 1 -length frequent tags with a length of 1 by testing the

support of each tag. The 1 -length frequent tags are combined together to form 2 -length

candidate tags. These candidate tags are tested to verify whether they are frequent or not.

The process is repeated with the 2 -length frequent tags until there are no more frequent

tags that could be found. This approach of frequent pattern generation is referred to

as generate-and-test as we generate the k+1 -length candidates by joining the frequent

k -length candidates and test whether the generated candidates are actually frequent. In

order to reduce the number of candidates generated, the apriori heuristic has been applied.

The basic intuition of this heuristic is that any subset of a frequent subpattern should also

be frequent. Hence, while generating the candidates only the frequent subpatterns are

used ignoring the infrequent subpatterns.

As the apriori heuristic generates candidates it could become expensive when there are

a very large number of documents which contain a large number of documents. In order

to overcome the problem of candidate generation, a novel approach called pattern-growth

was proposed which adopts the “divide-and-conquer” heuristic to recursively partition the

dataset based on the frequent patterns generated and then it mines them for frequent

patterns in each of the partitions. This technique is essentially suitable for datasets which

have large numbers of documents and are dense in nature.

Gu, Hwang, and Ryu [50] used the VSM model to identify frequent patterns from

them. Also, this model has been used by Braga, Campi, Ceri, Klemettinen, and Lanzi [17]

48

and Wan and Dobbie [116] to generate frequent patterns which were then used to create

association rules from these XML documents.

2.4.2.2 Graphs

Frequent graph mining is often popular for XML schema. This type of graph mining can

also be applied for the XML dataset in which various documents are linked to each other.

Frequent graph mining can be defined as follows.

Given a graph dataset D, find a subgraph g such that freq(g) ≥ min supp where

freq(g) is the percentage of graphs in D that contain g.

Similar to frequent itemset mining, the frequent graph mining methods count the

support of the 1-length subgraphs by scanning the dataset and identifying the 1-length

frequent subgraph. The next step involved is candidate generation in which it uses either

joins or extensions to the 1-length frequent subgraph to generate candidates. If a candidate

is generated by joining two frequent subgraphs then the technique is referred to as join.

If a candidate is generated by extending the nodes with any of the 1-frequent node then

it is referred to as extension. As there could be many possibilities to join or extend a

node; in order to limit the number of candidates generated and to avoid redundancy, the

rightmost node extension technique was introduced [12]. In this popular technique, only

the rightmost node is extended hence avoiding the repeated generation of same nodes.

After generating the candidates their support is determined by scanning through the

dataset. This process of generate-and-test is repeated until there are no more candidates

that could be generated.

Based on the graph traversal, the graph mining methods can be classified into breadth-

first and depth-first methods. Some of the apriori-based graph miners which belong to

49

the breadth-first category are AGM (Apriori-based Graph Mining method) [51], AcGM

(Apriori-based connected Graph Mining method) [53] and FSG [32]. These methods use

the apriori heuristic to reduce the search space. On the other hand, gSpan [128] and FFSM

[48] support depth-first traversal to find frequent subgraphs.

One of the well-known problems with graph mining methods is the subgraph isomor-

phism which has been considered as NP-complete [42]. Subgraph Isomorphism is the

problem of identifying whether a given graph, G, is an isomorph of subgraph, H, or not.

In simpler terms, it defines whether the graph G contains all the vertices of the subgraph

H or not. Given a graph G = (Vg, Eg) and a subgraph H = (Vh, Eh), G is said to contain

H when iff (x,y) in Eg contains (x, y) in Eh where x and y are the nodes of the graphs G

and H. As it is evident in graph frequent mining methods that the candidate subgraphs

are tested against the graph dataset, hence the frequent mining method is an instance of

subgraph isomorphism problem. Due to the huge number of candidate checks required, it

takes an exponential amount of time to identify frequent subgraphs.

However, some of the recent techniques such as Biased Apriori-based Graph Mining

(B-AGM) [52] provide results in an acceptable time period. In spite of the advancements

in graph mining, these methods have often been criticized for producing more generic

results than accurate results and this incurs in an expensive step in applying canonisation

for transforming the data into an uniform representation suitable for mining [135]. As

a result, recent researchers have shifted their attention to tree mining methods by using

trees instead of graphs and adapting graph mining techniques for frequent pattern mining

XML data, which is discussed in the following sub section.

50

2.4.2.3 Trees

Similar to frequent graph mining, the objective of applying frequent pattern mining on

XML documents represented in the tree data model is to identify frequent subtrees present

in the data. Frequent tree mining methods can be classified based on several factors such

as tree, subtree representation, traversal strategy, canonical representation, tree mining

approach adopted and the type of candidate generation techniques used for apriori-based

methods. Table 2.5 provides an outline of them.

The initial work on frequent tree mining was undertaken by Zaki [134], who proposed a

method to discover all subtrees in a forest (i.e., a large collection of ordered trees). There

are two steps involved in this method. Firstly, it enumerates candidate k-length subtrees

and counts the frequency of these subtrees by performing depth-first search. Secondly,

the k+1 -length subtree is generated from k -length subtrees where its frequency is greater

than a threshold. With a small alteration, this method can be applied in discovering the

frequent tree with unordered labelled trees.

In frequent tree mining of XML data, it has been noted from the sample XML dataset

(in Figure 1.2) that often the entire tree will not be frequent as in fragments (d) and

(e) due to the nodes “ConfLoc” and “ConfYear” respectively. Rather there is a good

possibility that parts of these trees could be frequent. The parts of such trees are referred

to as subtrees. A subtree from the sample XML dataset is shown in Figure 2.13.

There are different types of subtrees based on:

� Node relationship – induced and embedded

� Conciseness – Closed and Maximal

Based on the node relationship, the subtrees could have either parent-child or ancestor-

51

Table 2.5: Classifications of frequent tree mining methods

Based Types Methods

Tree representation Free Tree FreeTreeMiner[23]Rooted Unordered uFreqt [88]TreeRooted Ordered Unot [13]Tree

Subtree representation Induced subtree FREQT [12], uFreqt [88],HybridTeeMiner [21], Unot [13]PrefixTreeISpan [141]

Embedded subtree TreeMinerV [136],TMp [35], X3-Miner [106]

Traversal Strategy Depth-first FREQT [12], HBMFP [77],TreeMinerV [136], uFreqt [88]

Breadth-first FreeTreeMiner [23], X3-Miner [106]Depth-first & TreeMiner [134],Gaston [88]Breadth-first HybridTeeMiner [21]

Canonical Pre-order TreeMiner [134]representation string encoding

Level-wise HybridTeeMiner [21]encoding

Candidate Generation Extension FREQT [12], Unot [13]Join TreeMiner [134], PathJoin [123]Combination of HybridTeeMiner [21]extension and join

ConfTitle

Conference

ConfAuthor ConfName

Figure 2.13: Example of a subtree from the sample XML dataset in Figure 1.2

descendant relationships among their nodes resulting in induced and embedded subtrees

respectively. On the other hand, based on the conciseness, subtrees could be closed or

maximal if these subtrees are frequent and having the same support or different support

respectively.

52

Node relationship

The two types of subtrees based on the node relationship are induced subtrees and

embedded subtrees.

Induced subtree

For a tree T with edge set E and a node set V , we say that a Tree T ′ with node set

V ′, edge set E′ is an induced subtree of T if and only if (1) V ′ ⊂ V (2) E′ ⊂ E (3) the

labeling of the nodes of V ′ and E′ in T ′ is preserved in T . In simpler terms, an induced

subtree T ′ is a subtree which preserves a parent-child relationship among the vertices of

the tree, T .

Embedded subtree

For a tree T with edge set E and a node set V , we say that a Tree T ′ with node set

V ′, edge set E′ is an embedded subtree of T if and only if (1) V ′ ⊂ V ; (2) E′ ⊂ E;(3)

the labelling of nodes of V ′ in T ′ is preserved in T ; (4) (v1,v2) ∈ E, where v1 is the

ancestor of v2 in T ′ iff v1 is the ancestor of v2 in T ; and (5) for v1, v2 ∈ V′, preorder(v1)

< preorder(v2) in T ′ iff preorder(v1) < preorder(v2) in T . In other words, an embedded

subtree T ′ preserves ancestor-descendant relationship among the vertices of the tree, T .

Figure 2.14 shows an induced and an embedded subtree generated from the Tree, T .

It can be seen that Figure 2.14 (b) preserves the parent-child relationship and in Figure

2.14(c), though node “Book” is not the parent for “Name”, it is its ancestor node.

Based on the “traversal strategy”, the frequent-tree mining methods can be classified

into three categories, namely depth-first, breadth-first or a combination of both. In order

to traverse the trees, trees are often represented as a set of strings, with the representation

starting from the root node to the child nodes either in a breadth-first fashion (left-to-

right) or in a depth-first (top-to-bottom) fashion. ‘-1’ is used to mark the end of the nodes

53

Publisher

Book

Title Author

Name Name

Book

Title Author

Name

Book

Title Name

(a) (b) (c)

Figure 2.14: (a) A tree; (b) an induced subtree; (c) an embedded subtree

in that level or to indicate backtracking. The tree in Figure 2.15 (a simplified tree of the

Figure 2.14(a), using node labels as alphabets instead of tag names) can be represented

as < A B -1 C E -1 -1 D F -1 -1 -1> in level-wise encoding and pre-order string encoding.

The signs < and > represent the start and end of the representations respectively.

D

A

B C

E F

Figure 2.15: A sample tree using node labels as alphabets instead of the tag names

Similar to frequent itemset mining on transactional datasets, there are two popular

frequent tree mining approaches:

� generate-and-test

� divide-and-conquer

As the name implies, generate-and-test approach generate k+1-length candidates from

frequent k-length patterns and tests, whether the generated k + 1-length candidates are

54

CondDB B

Tid Trees

1 C -1

D -1

2 C -1

F -1

Freq Pattern

Proj

A

B

C

DT

Tid Trees

1 A B C -1 D -1 -1 -1

2 A B C -1 -1 F -1 -1

3 A F D -1 -1 K -1 -1

CondDB A

Tid Trees

1 B C -1 D -1 -1

2 B C -1 -1

F -1

3 F D -1 -1

(a) (b)

No projections

Figure 2.16: (a) a document tree dataset; (b) frequent patterns and their projectionsin that dataset using pattern-growth approach

frequent or not, against the dataset. One of the disadvantages of this approach is the

enormous number of candidate checks required and hence the requiring of many scans of

the dataset.

To overcome this disadvantage, a pattern-growth approach [46] was proposed which

divides the search space based on the frequent patterns generated from the previous phase.

Consider a document tree dataset DT as shown in Figure 2.16 with the tree ids and their

corresponding trees. Let us mine DT for frequent subtrees with a min supp ≥2. A

scan is conducted on TD to identify frequent patterns say ai, aj , ..., ak and based on the

discovered frequent patterns, the dataset is projected by extracting the patterns following

the frequent pattern. For instance, in Figure 2.16, the frequent patterns are A, B and

C. The conditional datasets are CondDBA, CondDBB and CondDBC for A, B and C

respectively. Further, the projected dataset is scanned to discover frequent items. Based

on the frequent substructures discovered, the dataset is projected and this process is

repeated until there are no more frequent substructures.

55

The TreeMiner method proposed by [135] adopts the generate-and-test approach to

generate embedded subtrees. However, it uses a vertical format to ease the support

counting for subtrees. Some of the other frequent pattern mining methods, which use

a generate-and-test strategy, are FREQT [12], HBMFP [77], TreeMinerV [136], uFreqt

[88] to mine induced frequent subtrees. On the other hand, XSpanner and Chopper [117]

utilise pattern-growth approaches to generate embedded subtrees.

Conciseness

General frequent pattern mining methods have been often criticized for producing

a very large number of frequent subtrees, which cause difficulties in understanding and

interpreting the results. In order to reduce the number of frequent subtrees and to derive

meaningful information from these frequent patterns, two popular concise representations

were proposed: closed and maximal.

Closed subtree

In a given document tree dataset, DT = {DT1,DT2,DT3, ...,DTn}, if two frequent

subtrees DT ′ and DT ′′ exist, and a frequent subtree DT ′ is closed of DT ′′ iff for every

DT ′ ⊃ DT ′′, supp(DT ′) = supp(DT ′′). This property is called closure.

Maximal subtree

In a given tree dataset, DT = {DT1,DT2,DT3, ...,DTn}, if two frequent subtrees

DT ′ and DT ′′ exist, DT ′ is said to be maximal of DT ′′ iff DT ′ ⊃ DT ′′, supp(DT ′) ≤

supp(DT ′′). This property is called maximality.

Unlike closure, maximality does not impose strict restrictions to have the same support;

hence it could result in M < C < F where:

M = Number of Maximal frequent subtrees

56

C = Number of Closed frequent subtrees

F = Number of Frequent subtrees

There are some methods that generate concise representations using only the generate-

and-test approach. Among them the popular ones are PathJoin [123], CMTreeMiner [24]

and DryadeParent [11]. PathJoin [123] uses a compact data structure called FST-forest to

generate only maximal frequent subtrees. On the other hand, CMTreeMiner was the first

method that was proposed to discover all closed and maximal frequent labelled induced

subtrees without first discovering all frequent subtrees. CMTreeMiner uses two pruning

techniques: left-blanket and right-blanket pruning. The blanket of a tree is defined as the

set of immediate supertrees that are frequent, where an immediate supertree of a tree T

is a tree that has one more node than T . The left-blanket of a tree T is the blanket where

the node added is not in the rightmost path of T (the path from the root to the rightmost

node of T ). The right-blanket of a tree T is the blanket where the node added is in the

rightmost path of T . CMTreeMiner computes, for each candidate tree, the set of trees

that are occurrence-matched with its blanket’s trees. If this set is not empty, two pruning

techniques using the left-blanket and right-blanket are applied. If it is empty, then they

check if the set of trees that are transaction-matched but not occurrence-matched with

its blanket’s trees is also empty. If this is the case, there is no supertree with the same

support and then the tree is closed. CMTreeMiner is a labelled tree method and it was

not designed for unlabelled trees. As is pointed out by the authors of CMTreeMiner, if

the number of distinct nodes are very large then the memory usage of CMTreeMiner is

expected to increase and hence its performance is expected to deteriorate.

Arimura and Uno proposed Cloatt [11] that applies closed mining on attribute trees,

which is a subclass of labelled ordered trees and can also be regarded as a fragment

of description logic with functional roles only. Additionally, these attribute trees have

two sibling nodes that cannot have the same label and are defined using a relaxed tree

57

inclusion. In the literature, closed frequent path mining methods also exist [127, 118].

However, due to the presence of sibling relationships in trees directly extending these path

mining methods for tree mining is not suitable.

Termier et al. proposed DryadeParent [108] as a closed frequent attribute tree mining

method to achieve performance comparable to CMTreeMiner. The DryadeParent method

is based on the computation of tiles (closed frequent attribute trees of depth 1) in the data

and on an efficient hooking strategy that reconstructs the closed frequent trees from these

tiles. Whereas CMTreeMiner [24] uses a classical generate-and-test approach to build

candidate trees edge by edge, the hooking strategy of DryadeParent finds a complete

depth level at each iteration and does not need tree mapping tests. The authors in [108]

claim that their experiments have shown that DryadeParent is faster than CMTreeMiner

in most settings and that the performances of DryadeParent are robust with respect to the

structure of the closed frequent trees to find, whereas the performances of CMTreeMiner

are biased toward trees having most of their edges on their rightmost branch. As attribute

trees are trees such that two sibling nodes cannot have the same label, DryadeParent is not

a method appropriate for dealing with real-life datasets where the sibling nodes have same

labels. To the best of our knowledge, no approach exists that utilises the pattern-growth

technique that are particularly suited for dense datasets to both induced and embedded

subtrees.

2.4.2.4 Paths

The ability to capture the hierarchical relationship between the tags has facilitated the use

of paths for frequent pattern mining. The path model can also be used to represent the

structure of the XML document as discussed earlier. Similar to the frequent tree mining,

frequent paths are discovered. Firstly, a scan of the dataset is conducted to identify the

58

frequent 1-length path which is just a node. These frequent nodes are combined to form

2-length paths. Testing is then carried out to verify how often they occur in the dataset.

If they occur more than the min supp then the paths are considered as frequent. This

technique is much more suitable for partial paths than for complete paths as the frequency

of complete paths could often be very low and hence there might not be sufficient frequent

paths to output. Often a large set of subpaths is generated especially for lower support

thresholds or on dense datasets. To reduce the number of common subpaths, a new

threshold called maximum support threshold (max supp) was introduced to avoid the

generation of very common subpaths, as these very common subpaths do not provide any

interesting or new knowledge [4]. In spite of the advancement in frequent path mining,

the frequent paths generated do not capture the sibling relationship between nodes hence

may incur an information loss.

2.4.3 Research gaps in frequent pattern mining

The following lists the research gaps in frequent pattern mining:

� Lack of efficient methods to generate concise frequent subtrees using a prefix-based

pattern growth approach.

� Lack of methods that could provide concise frequent subtrees with a parent-child

relationship (induced) or an ancestor-descendant (embedded) relationship.

� Lack of testing on real-life large-sized datasets.

� Lack of testing on the effectiveness of the concise frequent subtrees.

59

2.5 Summary and discussion

This chapter has reviewed the literature relevant to the two focus areas of this research

namely XML clustering and frequent pattern mining. It has provided details of the various

models of representing the XML documents, their features and the techniques for clustering

and frequent pattern mining. From the literature review, the following limitations are

ascertained based on the current XML clustering and frequent pattern mining methods:

� Lack of effective clustering approaches that could combine the structure and the

content of XML documents non-linearly.

– Lack of effective measures to control the increase in dimensionality due to com-

bination of structure and content features.

– Lack of higher-dimension models for clustering XML documents.

– Limited clustering methods using the outcome of frequent pattern mining meth-

ods in clustering the XML documents. Even these clustering methods have used

only one feature and not both features of XML documents. Moreover, these

methods are limited to very small datasets [4].

� Lack of efficient frequent pattern mining methods that could scale for datasets with

large number of documents and/or very high branching factor which is the nature

of dense datasets.

– To the best of the author’s knowledge, no frequent subtree mining approach

exists that could provide concise frequent subtrees using prefix-based pattern

growth methods for mining real-life datasets.

– Lack of concise frequent pattern mining methods that could generate concise

frequent subtrees with different types of node relationships and conciseness.

60

In order to get around the above-mentioned limitations, frequent subtrees should be

used to identify the common substructures and to utilise them to extract the content from

the XML documents. By doing so, the structure is utilised and its corresponding content is

used for clustering. There could be two ways of representing the combination: implicitly

using the VSM or explicitly using the TSM. These types of combinations can help to

reduce the complexity and the scalability of applying clustering on XML documents by

restricting only to content constrained by the structure features.

2.6 Chapter summary

This chapter has reviewed the current state-of-the-art research in the problem areas ad-

dressed within this thesis. As mentioned in Chapter 1, the two main problem areas are

clustering and frequent pattern mining on semi-structured data, XML. This chapter has

provided an overview of XML, its features, its benefits over other semi-structured data

and the current research on XML mining.

This chapter also examined the state of the research into the specific clustering tasks

addressed in this thesis. They are the data models of XML for the purpose of clustering

and calculating their similarity measures. A survey of the main clustering methods for

XML was presented which helped to identify that the existing XML clustering methods

rely either on the structure or on its content and hence could result in poor accuracy. A

general remark is that due to the complexity and scalability issues, both of these features

are not included in the clustering process. A common challenge is to identify an effective

approach to combine these two features, structure and content, and combat the complexity

without sacrificing the accuracy of the clustering solution.

The remainder of the chapter examined the state of the research into a frequent pat-

61

tern mining problem, with the main focus on tree-structured data sources and on the

sub-problem of concise frequent subtree generation. It was noted that XML document

structures can be represented as tree structures essentially in the form of rooted, ordered

and labelled trees. However, different types of frequent subtrees exist such as induced

and embedded subtrees based on the type of relationship between the nodes. A general

overview of theoretical foundations of topics related to the frequent subtree mining was

first provided. This allows to distinguish the important aspects of the tree mining prob-

lem and to understand the importance of tree-structured data over other representations.

Two different frequent pattern generation approaches, apriori and pattern growth, were

discussed. Strengths and weaknesses of these approaches were indicated, followed by an

overview of the existing frequent subtree mining methods with respect to the types of

subtrees mined. Finally, the research gaps in both clustering and frequent pattern mining

were presented. This has led the research in this thesis and the following chapters will

describe how this research addresses these limitations effectively.

62

Chapter 3

Research Design

3.1 Introduction

This chapter describes the research design used in developing the proposed frequent pat-

tern mining and clustering methods. The frequent pattern mining methods for generating

different types of concise frequent patterns are discussed in Chapter 4. Chapter 5 pro-

vides the clustering methods for non-linearly combining the structure and content of XML

documents.

In this chapter, the datasets that have been used to benchmark these proposed frequent

pattern mining methods and clustering methods over the state-of-the-art methods are

presented. A wide range of both synthetic datasets and the real-life datasets exhibiting

variations in their characteristics were chosen to evaluate the impact of the various features

in a dataset on the proposed methods. Some of the characteristics are the nodes’ branching

factor, depth of the trees and the size of the dataset. This chapter also covers a wide

range of evaluation metrics that were used to measure the effectiveness of the approaches

proposed in this thesis.

63

Finally, several state-of-the-art methods in both frequent pattern mining and clustering

that have been used for evaluating the effectiveness of the proposed techniques have been

provided.

3.2 Research Design

The aim of this research from a clustering perspective is to develop an efficient clustering

method for providing meaningful and accurate clusters. From the frequent pattern mining

perspective, the aim of this research is to develop efficient and effective frequent patterns

that are useful in capturing the structural similarity and that aid in clustering of XML

documents. As illustrated in Figure 3.1, there are three major phases in the proposed

research. Each of them is described as follows.

3.2.1 Phase-One: Pre-processing

The first phase is pre-processing, which includes extracting the structure and content

of the XML documents and representing them in the form of trees. This is one of the

most important and time-consuming tasks in any data mining project as it is often an

iterative step and consumes several iterations. The main purpose of this phase is to

provide a suitable representation of XML documents for the mining techniques such as

frequent pattern mining and clustering in the subsequent phases. The first step in this

phase involves extracting the structure and content of XML documents. The structure

and content of XML documents require several pre-processing steps. For instance, the

structure of XML documents needs to be parsed and converted into a tree-like structure

and then represented in a suitable form for mining tasks. An analysis of many XML

documents revealed that they contain redundant information in their structure. To reduce

64

Phase 1:Pre-processing

Phase 3a:Clustering using VSM

Model thestructure of XML

documents asdocument trees

ProcessedDocument

trees

Pre-processing ofDocument trees

Phase 2: Frequent Pattern Mining

Concise FrequentSubtree Mining usingprefix-based patterngrowth techniques

ConciseFrequentSubtrees

Phase 3b: Clustering using TSM

Extract contentusing concise

frequent subtrees

ApplyClusteringalgorithm

Cluster 1

Cluster 2

Cluster N

Cluster 3

Represent the contentconstrained by theconcise frequent

subtrees in TensorSpace Model (TSM)

Apply TensorDecomposition

algorithmRepresent the

content constrainedby the concise

frequent subtreesin Vector Space

Model

Extract contentusing concise

frequent subtrees

ApplyClusteringalgorithm

Decomposedvalues

Cluster concisefrequentsubtrees

XMLdocuments

Figure 3.1: Research design

65

the information overload, these redundant structures should be identified and removed.

The content of XML documents also requires pre-processing such as stemming, stopword

removal and shorter words removal. The output of this phase is the document trees and

the processed content. The next phase of this research is to apply frequent pattern mining

techniques on the generated document trees, based on only the structure features of the

XML documents.

3.2.2 Phase-Two: Frequent Pattern Mining

The main aim of this research is to combine the structure and the content features of XML

documents for clustering. However, representing the structure and the content features

of XML documents in VSM is an expensive task and could cause information overload

for a clustering algorithm. Hence to reduce the overload, this research proposed to utilise

frequent pattern mining techniques to identify the prominent subtrees and use them to

extract the relevant content.

This phase generates the concise frequent structure of the XML documents to create

an useful form to be used for the clustering phase. In order to extract the content of

the XML documents using the structure-based frequent patterns, two types of frequent

subtrees exist, namely induced and embedded subtrees, preserving the parent-child and the

ancestor-descendant relationships respectively. This research makes use of both types of

subtrees, assuming that embedded subtrees may expose some of the hidden relationships

that are not identified using induced subtrees. The number of generated subtrees after

applying the frequent tree mining algorithms to document trees is usually very large. It

is essential to reduce the number of frequent subtrees generated by identifying only the

concise representations such as closed and maximal.

This thesis proposes four frequent pattern mining algorithms for generating the closed

66

and maximal frequent induced and embedded subtrees. Utilising the concise frequent

subtrees generated from the frequent pattern mining algorithms, the content from the XML

documents is extracted and represented in the pre-cluster form suitable for clustering. By

doing so, the dimension of input data matrix for clustering is reduced.

3.2.3 Phase-Three (a): Clustering using VSM

This phase involves implicitly representing the structure and the content of XML docu-

ments using the frequent subtrees. It performs a non-linear combination of structure and

content features by utilizing the concise frequent subtrees generated from the previous

phase to extract the content and represent them in a Vector Space Model (VSM). Fi-

nally, a clustering algorithm is then applied on the VSM to create the required number of

clusters.

3.2.4 Phase-Three (b): Clustering using TSM

Clustering using the TSM phase proposes a novel way of combining the structure and the

content using a multi-dimensional model. It begins with clustering the concise frequent

subtrees and then extracting the content corresponding to them. The extracted content,

along with the structure and the document id, is then represented in a higher-order Tensor

Space Model (TSM). Unlike the VSM, the TSM involves representing both the features

– structure and the content – in an explicit manner for a given document. A tensor

decomposition algorithm is then applied on the tensor; the resulting decomposed values

provide the structure and content similarity between the documents. Clustering is then

applied on the decomposed values to generate the required number of clusters.

In order to evaluate the proposed methods developed for both frequent pattern mining

and clustering, the following section will discuss about their experiment set-up.

67

3.3 Experiment Set-Up

Experiments were conducted on the QUT High Performance Computing system, with a

RedHat Linux operating system, 16GB of RAM and a 3.4GHz 64bit Intel Xeon processor

core. C++, C# and Matlab were used for the implementation of the proposed algorithms.

To be consistent with the other frequent tree mining algorithms for the purpose of bench-

marking, C++ was used to implement the proposed frequent tree mining algorithms. C#

was used for parsing the XML documents and extracting its structure and content. For

creating and manipulating the tensors, the toolbox was provided in Matlab and hence

this programming language was used. Matlab was also used to develop the proposed ten-

sor decomposition algorithm; python was used as the scripting language to evaluate the

results.

3.4 Datasets

Both the synthetic and real-life datasets were used in the evaluation of the mining tech-

niques. The synthetic datasets were primarily used to benchmark some of the existing

frequent tree mining algorithms against the proposed methods on their runtime and scala-

bility performance. The real-life datasets includes varied-size datasets ranging from small

to large.

3.4.1 Synthetic Datasets

The Zaki’s tree generator1 has often been used to generate the synthetic datasets for

benchmarking the tree mining algorithms. Using the Zaki’s tree generator, two synthetic

datasets were generated, namely the F5 and D10 datasets, with the parameters as indicated

1http://www.cs.rpi.edu/˜zaki/software

68

in Table 3.1, where “f” represents the fan out factor, “d” the depth of the tree, “n” the

number of unique labels for the trees, “m” the total number of nodes in a parent tree and

“t” indicates the number of trees.

Table 3.1: Synthetic datasets and their parametersName DescriptionF5 -f 5 -d 10 -n 100 -m 100 -t 100000D10 -f 10 -d 10 -n 100 -m 100 -t 100000

Studies have indicated that the performance of some of the existing frequent subtree

mining methods degrade for datasets having a high branching factor [108]. To evaluate

the performance of the proposed frequent pattern mining algorithms against the current

state-of-the-art algorithms, two datasets F5 and D10 with varied branches, fan-out factors

of 5 and 10 respectively, were generated.

3.4.2 Real-life Datasets

Five real-life datasets have been used for benchmarking both frequent subtree mining and

clustering methods. These are classified based on the size of the dataset into the following

groups:

� Small-sized real-life dataset;

� Medium-sized real-life dataset; and

� Large-sized real-life datasets.

3.4.2.1 Small-sized real-life dataset

The ACM dataset is a small-sized real-life dataset that contains 140 XML documents

corresponding to two DTDs, IndexTermsPage.dtd and OrdinaryIssuePage.dtd (with about

70 XML documents for each DTD), similar to the setup in XProj [4] . It does not contain

69

Table 3.2: Details of categories in the ACM datasetCategory types Categories # DocumentsStructure-Only IndexTermsPages 70

OrdinaryIssuePages 70Structure-Content DTD-based 70

General 7Mobile computing 3

Database Management Systems (DBMS) 42Others 18

any schema definitions such as XSD or DTD. Also, this dataset contains both semantic

tags and formatting tags. Table 3.2 provides the two sets of categories for this dataset.

Previous researchers [4] have used the ACM dataset to cluster the documents into two

groups according to their structural similarity. To compare the proposed work with theirs,

experiments were conducted with two cluster categories according to structural similarity.

It is actually comparatively easier and more straightforward to group this dataset accord-

ing to structural similarity as the documents come with two different schema definitions.

More complexity has been added in the second set of experiments by conducting the

structure-and-content-based clustering. This experimental design utilises expert knowl-

edge and is based on 5 groups considering both the structural and the content features of

XML documents. The first category is based on the document structure and the remaining

four categories are based on the document content, namely General, Mobile computing,

Database Management Systems (DBMS) and Others.

3.4.2.2 Medium-sized real-life dataset

The dataset used in this thesis is a subset of journal articles and conference papers from

the original XML DBLP archive. Table 3.3 shows the details of the DBLP and Table 3.4

shows the categories that have been used to split the documents in this dataset.

This dataset is a subset of the DBLP archive, which is a digital bibliography on com-

puter science containing journal articles, conference papers, books, book chapters and

theses. DBLP exhibits a certain structural variety different from other datasets. It is

70

Table 3.3: Details of ACM and DBLP datasetsAttributes ACM DBLPNo. of Docs 140 3882No. of tags 38 32

No. of internal nodes 2070 28674Max length of a document 45 25

Average length of a document 14 7No. of distinct terms 7135 10766Total No. of words 38141 75742Size of the collection 1 MB 4.36MB

Presence of formatting tags No NoPresence of Schema Yes No

Number of Categories 5 8

Table 3.4: Details of categories in the DBLP datasetCategory Name # Documents

Books 1282Conference 1664Journals 783

Miscellaneous 2Persons 13Phd 74

Technical report 29World wide web 35

characterized by a small average depth and offers quite short text descriptions (e.g., au-

thor names, paper titles, conference names). Table 3.4 provides the categories in the

DBLP that contains 8 categories with dblp articles mostly from books, conferences and

journals.

3.4.2.3 Large-sized real-life datasets

The three datasets that belong to this group are :

� INEX IEEE dataset;

� INEX 2007 dataset; and

� INEX 2009 dataset.

These datasets, each of more than 5000 documents, were obtained from the clustering

task in the INitiative for the Evaluation of XML Retrieval (INEX)2. INEX forum is a col-

laborative forum bringing together researchers from many fields to evaluate their methods

2http://www.inex.otago.ac.nz/

71

in XML Mining and IR, using real-life datasets such as Wikipedia and IEEE proceedings.

The clustering task in this forum began in 2002 with the IEEE proceedings. In 2005 this

collection of IEEE proceedings was expanded with more IEEE proceedings and in 2006

the IEEE collection was complemented with an XML dump of the Wikipedia, which was

later updated in 2009. The Wikipedia datasets used in 2008 was considered highly instable

[137] and there was a very small number of labels in that dataset.

INEX IEEE dataset

The IEEE collection version 2.2, which has been used in the INEX document mining

track 2006, consists of 6054 articles originally published in 23 different IEEE journals

from 2002 to 2004. The articles follow a complex schema that includes front matter, back

matter, section headings, text formatting tags, and mathematical formula [104].

Table 3.5 provides the details of the categories in this dataset with 6 thematic labels

and 2 structural labels. The thematic or content labels are Computer, Graphics, Hard-

ware, Artificial Intelligence (AI), Internet and Parallel Computing. The structural labels

are IEEE Transactions and IEEE Journals. For instance, “tc” category belong to the

“transactions” structural label and “Computer” content/thematic label. In simple words,

“tc” is an IEEE Transaction on the Computer.

Table 3.5: Details of categories in the INEX IEEE datasetContent / Computer Graphics Hardware AI Internet ParallelStructure

Transactions tc, ts tg tp, tk tdon

Journals an, co, cs, it, so cg dt, mi ex, mu lc pdon

INEX 2007 dataset

The INEX 2007 Wikipedia clustering task corpus contains 48,305 documents. These

documents have deep structures and a high branching factor. The documents set does

not contain any schema definitions such as XSD or DTD. Also, this dataset contains both

72

semantic tags and formatting tags.

Table 3.6 lists the categories that in the INEX 2007 dataset. There were 21 categories

which are not well balanced: some categories are large (Portal:Law is composed of about

25% of the documents), while others are small (Portal:Music is very small with only 0.5 %

of the documents) [41]. This dataset having these 21 categories will help to identify how the

proposed models behave in the presence of small categories and also to identify ambiguous

categories such as Portal:Pornography and Portal:Sexuality; or Portal:Christianity and

Portal:Spirituality.

Table 3.6: Details of categories in the INEX 2007 datasetId Category # Documents

2112299 Portal:Law 121051597184 Portal:Literature 84181484914 Portal:Sports and games 72671480358 Portal:Art 38841886386 Portal:Physics 26593091788 Portal:Christianity 22342773006 Portal:Chemistry 23141685758 Portal:History 15883091127 Portal:Spirituality 13292914908 Portal:Sexuality 12192879927 Portal:War 11301620218 Portal:Archaeology 6601507239 Portal:Aviation 6172328885 Portal:Formula One 5911486363 Portal:Astronomy 5551895383 Portal:Trains 4842635947 Portal:Comics 3042314377 Portal:University 2752257163 Portal:Pornography 2412263642 Portal:Writing 230474166 Portal:Music 201

INEX 2009 dataset

INEX 2009 clustering task corpus containing 54,575 documents was used in this re-

search as there were a number of submissions for the clustering task which will help to

evaluate the proposed clustering methods against them. The subset contained 5,243 unique

entity tags and 1,900,072 unique terms. Table 3.7 shows the INEX 2009 dataset used in

this research.

As shown in Table 3.7 there are two sets of categories in this dataset. The first set

of categories is derived from Wikipedia categories and the top-20 of the categories are

73

listed in Table 3.8. A complete list of all the categories in this set is provided in Appendix

A.1. In this category set, the documents in the dataset contains multiple categories, with

most of the documents containing more than one category. Hence, the sum of the number

of documents in each of the categories exceeds the total number of documents in the

collection.

Table 3.7: Details of large-sized datasetsAttributes INEX IEEE INEX 2007 INEX 2009No. of Docs 6054 48,305 54,575No. of tags 165 5814 34,686

No. of internal nodes 472,351 4,487,819 15,128,407Max length of a document 691 659 10347

Average length of a document 78 19 277No. of distinct terms 114,976 535,351 1,900,072Total No. of words 3,695,550 16,682,466 21,480,198Size of the collection 272MB 360 MB 2.94GB

Presence of formatting tags Yes Yes YesPresence of Schema Yes Yes Yes

Number of Categories 18 15 4052

Table 3.8: Details of the top-20 categories in the INEX 2009 dataset using Wikipediacategories

Id Category # Documents1 People 153592 Society 126633 Geography 90654 Culture 90335 Politics 85896 History 80357 Nature 57888 Countries 57249 Applied sciences 556810 Humanities 520511 Business 373412 Technology 358413 Science 337814 Arts 283715 Historical eras 278016 Health 276017 Entertainment 252118 Belief 241719 Life 230120 Language 2140

The second set of categories in INEX 2009 dataset is used to evaluate the collection

selection problem (discussed in Section 3.5). It is based on the 52 topics in the ad hoc

queries posed by the volunteers in the INEX forum. Table 3.9 lists the top-20 queries that

were used and the number of documents that were found to be relevant to the query. The

full list of all the categories in this set is provided in Appendix A.2. Among the 52 queries

there were about 22 queries which had less than 5 relevant documents. Using this set of

categories will help to identify a very accurate clustering solution that could be useful for

74

Table 3.9: Details of the top-20 categories in the INEX 2009 dataset using ad hocqueries

Id Query Title # Documents2009043 NASA missions 1352009005 Chemists physicists scientists alchemists periodic 82

table elements2009093 French revolution 402009013 Native American Indian wars against 33

colonial Americans2009039 Roman architecture 272009063 D-Day normandy invasion 272009040 Steam engine 252009055 European union expansion 242009036 Notting Hill Film actors 222009051 Rabindranath Tagore Bengali literature 182009023 “Plays of Shakespeare”+Macbeth 162009076 Sociology and social issues and 14

aspects in science fiction2009105 Musicians Jazz 102009035 Bermuda Triangle 92009061 France second world war normandy 92009064 Stock exhange insider trading crime 92009096 Eiffel 92009001 Nobel prize 82009011 Olive oil health benefit 82009033 Al-Andalus taifa kingdoms 8

information retrieval.

3.5 Evaluation measures

Distinct evaluation measures were used for both the frequent pattern mining phase and

clustering phase in this research. For evaluating frequent pattern mining methods, the

commonly used metrics such as the runtime and the number of frequent patterns gener-

ated for various support thresholds (min supp) are utilised. On the other hand, purity,

F1 measures and NMI were used for evaluating the efficiency of the clustering solution

produced by the proposed clustering methods and the benchmarks for clustering. The

execution times for the decomposition algorithms are also used to evaluate the clustering

methods.

With the availability of the results of manual evaluation for information retrieval on the

INEX 2009 dataset, existing evaluation metrics are not suitable to evaluate the clustering

methods for information retrieval. Hence, a new measure called Normalized Cumulative

Cluster Gain (NCCG) (introduced in [81] as part of the INEX 2009 clustering task) was

75

utilised for evaluating the effectiveness of the clustering solution for the problem of collec-

tion selection.

3.5.1 Frequent pattern mining

There are two evaluation measures that are used to evaluate the performance of the fre-

quent pattern mining methods. They are the runtime of these methods and the number

of frequent patterns generated from the frequent pattern mining methods.

Runtime (λ) in seconds

This is the time taken to complete the generation of the frequent patterns from the

given dataset for a given support threshold (min supp). It is measured in seconds.

Number of Frequent Patterns (ρ)

This is the total number of frequent patterns generated from a given dataset for a

given support threshold (min supp).

3.5.2 Clustering

This research focuses on using purity, F1 and NMI measures to evaluate the clustering

methods.

Purity

The standard criterion of purity is used to determine the quality of clusters by mea-

suring the extent to which each cluster contains documents primarily from one category.

The simplicity and the popularity of this measure means that it has been used as the only

evaluation measure for the clustering task in the INEX 2006 and INEX 2009. In general,

the larger the value of purity, the better the clustering solution.

76

Let ω = {w1, w2, . . . , wK}, denote the set of clusters for the dataset D and ξ =

{c1, c2, . . . , cJ} represent the set of categories. The purity of a cluster wk is defined as:

P (wk) =maxj |wk ∩ cj |

|wk| (3.1)

where wk is the set of documents in cluster wk and cj is the set of documents that occurs in

category cj. The numerator indicates the number of documents in category k that occurs

most in cluster j and the denominator is the number of documents in the cluster wk.

The purity of the clustering solution ω can be calculated based on micro-purity and

macro-purity. Micro-purity of the clustering solution ω is obtained as a weighted sum of

individual cluster purity. Macro-purity is the unweighted arithmetic mean based on the

total number of categories [29].

Micro− Purity(ω, ξ) =

∑Kk=0 P (wk) ∗ |wk|∑K

k=0 |wk ∩ cj |(3.2)

Macro− Purity(ω, ξ) =

∑Kk=0 P (wk)

J(3.3)

F1-measure

Another standard measure that is used to evaluate the clustering solution is the F1-

measure. It helps to calculate not only the number of documents that are correctly clas-

sified together in a cluster but also the number of documents that are misclassified from

the cluster.

In order to calculate the F1-measure, three types of decisions are used. Among them

77

there are two types of correct decisions: True Positives (TP) and True Negatives (TN). A

TP decision assigns two similar documents to the same cluster; a TN decision assigns two

dissimilar documents to different clusters. On the other hand, a False Positive (FP) is an

error decision that assigns two dissimilar documents to the same cluster [76]. Though there

is another error decision, FN, that assigns two similar documents to different clusters, it

is not used in calculating F1-measure.

Using the TP, TN and FP decisions, the precision and the recall for the micro-F1 are

defined as:

precisionmicro−F1 =

∑Jj=1 TPj∑J

j=1 TPj + FPj

(3.4)

recallmicro−F1 =

∑Jj=1 TPj∑J

j=1 TPj + TNj

(3.5)

The precision and the recall for the macro-F1 are defined as

precisionmacro−F1 =

∑Jj=1 TPj

∑Jj=1 TPj+FPj

J(3.6)

recallmacro−F1 =

∑Jj=1 TPj

∑Jj=1 TPj+TNj

J(3.7)

where TPj is the number of documents in category cj that exists in cluster wk, TPj is the

number of documents that is not in category cj but that exists in cluster wk and TNj is

the number of documents that is in category cj but does not exist in cluster wk.

F1 can now be defined as:

78

F1 =2× precision× recall

precision+ recall(3.8)

Micro-F1 =2× precisionmicro-F1 × recallmicro-F1

precisionmicro-F1 + recallmicro-F1(3.9)

Macro-F1 =2× precisionmacro-F1 × recallmacro-F1

precisionmacro-F1 + recallmacro-F1(3.10)

Normalized Mutual Information (NMI)

Another evaluation measure is the Normalized Mutual Information (NMI) which helps

to identify the trade-off between the quality of the clusters against the number of clusters

[76].

NMI [76] is defined as,

NMI(ω, ξ) =I(ω; ξ)

[H(ω) +H(C)]/2(3.11)

I(ω; ξ) =

∑k

∑j P (wk ∩ cj)log

P (wk∩cj)P (wk)P (cj)∑

k

∑j|wk∩cj |

N logN |wk∩cj ||wk||cj|

(3.12)

where P (wk), P (cj) and P (wk ∩ cj) indicate the probabilities of a document in cluster wk,

category cj and in both wk and cj .

H(ω) is the measure of uncertainty given by,

H(ω) =−∑

k(P (wk)logP (wk))

−∑

k|wk|N log |wk|

N

(3.13)

79

3.5.3 Collection selection evaluation using NCCG measure

This evaluation measure was used in evaluating the INEX 2009 dataset [81] and is based on

Van Rijsbergen’s clustering hypothesis. Van Rijsbergen and his co-workers [92] conducted

an intensive study on the use of the clustering hypothesis test on information retrieval,

which states that documents which are similar to each other may be expected to be relevant

to the same requests; dissimilar documents, conversely, are unlikely to be relevant to the

same requests. If the hypothesis holds true, then relevant documents will appear in a small

number of clusters and the document clustering solution can be evaluated by measuring

the spread of relevant documents for the given set of queries.

To test this hypothesis on a real-life dataset, the INEX 2009 dataset, the clustering

task was evaluated by determining the quality of clusters relative to the optimal collection

selection [81]. Collection selection involves splitting a collection into subsets and recom-

mending which subset needs to be searched for a given query. This allows a search engine

to search a fewer number of documents, resulting in improved runtime performance over

searching the entire collection.

The evaluation of collection selection was conducted using the manual query assess-

ments for a given set of queries from the INEX 2009 Ad Hoc track [81]. The manual

query assessment is called the relevance judgment in Information Retrieval (IR) and has

been used to evaluate ad hoc retrieval of documents. It involves defining a query based on

the information need, a search engine returning results for the query and humans judging

whether the results returned by the search engine are relevant to the information need.

Better clustering solutions in this context will tend to (on average) group together

relevant results for (previously unseen) ad hoc queries. Real ad hoc retrieval queries and

their manual assessment results are utilised in this evaluation. This approach evaluates

80

the clustering solutions relative to a very specific objective – clustering a large document

collection in an optimal manner in order to satisfy queries while minimising the search

space. The metric used for evaluating the collection selection is called the Normalized

Cumulative Cluster gain (NCCG) [81].

The NCCG is used to calculate the score of the best possible collection selection ac-

cording to a given clustering solution of n number of clusters. The score is better when

the query result set contains more cohesive clusters. The Cumulative Gain of a Cluster

(CCG) is calculated by counting the number of documents of the cluster that appear in

the relevant set is returned for a topic by manual assessors.

CCG(c, t) =

n∑

i=1

(Reli) (3.14)

For a clustering solution for a given topic, a (sorted) vector CG is created representing

each cluster by its CCG value. Clusters containing no relevant documents are represented

by a value of zero. The cumulated gain for the vector CG is calculated, which is then

normalized on the ideal gain vector. Each clustering solution c is scored for how well it

has split the relevant set into clusters using CCG for the topic t.

SplitScore(t, c) =

|CG|∑ cumsum(CG)

nr2(3.15)

where nr = Number of relevant documents in the returned result set for the topic t and

cumsum indicates the cumulative sum.

A scenario with worst possible split is assumed to place each relevant document in a

distinct cluster. Let CG1 be a vector that contains the cumulative gain of every cluster

81

with each document.

MinSplitScore(t, c) =

|CG1|∑ cumsum(CG1)

nr2(3.16)

The normalized cluster cumulative gain (nCCG) for a given topic t and a clustering

solution c is given by,

nCCG(t, c) =SplitScore(t, c)−MinSplitScore(t, c)

1−MinSplitScore(t, c)(3.17)

The mean and the standard deviation of the nCCG score over all the topics for a

clustering solution are then calculated.

Mean(nCCG(c)) =

∑nt=0 nCCG(t, c)

Total Number of topics(3.18)

Std Dev (nCCG(c)) =

∑nt=0 nCCG(t, c)−Mean(nCCG(c))

Total Number of topics(3.19)

The NCCG value varies from 0 to 1. A larger value of NCCG for a given clustering

solution is better, since it represents the fact that an increased number of relevant doc-

uments are clustered together in comparison to a smaller number of relevant documents.

Further details of this metric can be found in [81]

Decomposition Time (λd)

This is the time in seconds taken to complete decomposing the tensor model that has

82

been built using the structure and the content of XML documents.

3.6 Benchmarks

This section details the benchmarks used for evaluating both the frequent pattern min-

ing and clustering methods. The aim of evaluating the proposed methods against these

benchmarks is to understand the strengths and weaknesses of both the proposed methods

and these benchmarks for the chosen datasets (detailed in the previous section). Not only

the state-of-the-art methods in both frequent pattern mining and clustering tasks were

used as benchmarks but also different methods to serve as benchmarks were created to

understand the effectiveness of the proposed methods over these benchmarks.

3.6.1 Frequent pattern mining

The experiments were evaluated against other state-of-the-art methods. In frequent pat-

tern mining, methods such as MB3-Miner [105], TreeMinerV [134], PrefixTreeISpan [141]

and PrefixTreeESpan [140] are used to benchmark the proposed frequent pattern min-

ing methods. Among them, TreeMinerV and MB3-Miner are the representatives for the

generate-and-test approach to generate embedded subtrees. The PrefixTreeISpan and

PrefixTreeESpan methods adopt a prefix-based pattern growth algorithm to generate fre-

quent induced and embedded subtrees respectively. Table 3.10 details the benchmarks for

frequent pattern mining methods on the type of subtrees, generation approach and the

distinct advantage of the benchmarks.

83

Table 3.10: Benchmarks for frequent pattern mining methodsName Type of Generation Distinct

Subtrees approach advantageMB3-Miner [105] Induced Generate-and-test Tree Model Guided

candidate generationto reducecandidate enumeration

TreeMinerV [134] Embedded Generate-and-test SimplicityPrefixTreeISpan [141] Induced Prefix-based Suitable for

pattern growth dense datasetsPrefixTreeESpan [140] Embedded Prefix-based Suitable for

pattern growth dense datasets

3.6.2 Clustering

To clearly understand the strengths and weaknesses of the proposed hybrid clustering

methods, various clustering representations, other clustering methods from INEX and

clustering methods using different decomposition techniques were used as benchmarks in

this research. This subsection details each of them. Table 3.11 lists the benchmarks for

clustering methods.

Table 3.11: Benchmarks for clustering methodsBased on NameRepresentations SO

COS+C

Clustering Methods BilWeb-CO [81]from INEX PCXSS [84]

CRP and 4RP [131]Doucet et al. [36]

Clustering using different CPtensor decompositions Tucker

MACH

3.6.2.1 Based on representations

The objective of this set of comparisons was to evaluate whether the combination of

structure and content used by the proposed clustering methods is better than structure-

only and content-only. Furthermore, it is also used to understand whether the proposed

type of combination is better than the linear or naive way of combining the structure and

the content of XML documents for clustering.

The following are the representations that are used for comparing the outputs of the

proposed clustering method on various real-life datasets.

84

Structure Only (SO) Representation

An input matrix D×CF is generated where D represents the documents and CF rep-

resents the list of concise frequent induced subtrees that have been used. Each document

is represented by the CF subtrees that are present in it.

Content Only (CO) Representation

The content of XML documents is represented in a matrix D×Terms with Terms are

obtained after pre-processing techniques such as stop-word removal, stemming and integer

removal. Each entry in the matrix contains the term frequency of terms in D.

Structure and Content (S+C) Representation

In this representation, the structure and content features for the documents are rep-

resented in a matrix by concatenating the occurrences of the structure features (CF sub-

trees) and the content features (terms) side by side for a document. It is represented as

[D × CF ;D × Terms].

3.6.2.2 Based on other clustering methods from INEX

This research aims to evaluate the proposed methods against the clustering results of the

methods of other participants from the INEX forum that are available. BilWeb-CO [81],

PCXSS [84], CRP, 4RP, Word descriptor [131], Doucet et al. [36] and Self-Organising

Map for Structured Data (SOM-SD), Contextual SOM-SD (CSOM-SD) and GraphSOM

(Graph-based SOM). Each of the clustering methods is discussed in detail as follows:

BilWeb-CO [81]

This methodology was proposed by a research team from the Bilkent Web Databases

Research Group, which used Cover-Coefficient based Clustering Methodology (C3M) for

85

clustering the XML documents on the INEX 2009 dataset detailed in Section 3.4. C3M

is a single-pass partitioning type clustering method based on the probability of selecting

a document given a term that has been selected from another document. To scale for the

large collection of documents in the INEX 2009 dataset, a compact representation of the

documents was generated by adapting term-centric and document-centric index pruning

techniques. They cluster the documents with these compact representations for various

pruning levels, again using the C3M method.

Progressively Clustering XML by Semantic and Structural Similarity (PCXSS)

[84]

This clustering method uses only the structural similarity between XML documents. It

is based on the structural similarity of the XML documents. Firstly, it defines a similarity

measure called CPSim (Common Path Similarity) between an XML document which is

computed between the two documents and is based on the following criteria:

� The number of common nodes between the document and the documents of the

cluster;

� The number of common nodes in a given path of the XML document; and

� The ordering of the nodes of the XML document.

CPSim computed from the aforementioned criteria is then used by this incremental

clustering method.

Common Rare Path (CRP), 4RP (4-length Rare Path) and Word descriptor

[131]

The authors, Yao and Zarida, utilised the paths to cluster XML documents using both

structure and content features. Their experimental methods were applied on the INEX

86

2007 dataset. Three methods were proposed, namely CRP, 4RP and Word descriptor,

based on the paths. In the CRP clustering method, the complete root path which is the

full path starting from the root node to the text node is used to measure the similarity

between two XML documents. In the 4RP clustering method, a partial path of length 4

(containing 4 nodes) which contains the content in the text node is used to measure the

similarity between XML documents. Finally, the word descriptor utilises only the words

present in the document. All of these methods used a combination of partitional and

agglomerative clustering approaches to cluster the documents.

Doucet et al. [36]

This method represents the structure and the content features of the XML documents

in a VSM and then directly applies a K-means algorithm for the clustering. It assigns a

weight for both the structure and the content features for integrating these two features

in the VSM linearly.

SOM-based approaches [60, 45]

There were three approaches that will be used to benchmark the proposed methods

based on the neural network model, SOM, namely SOM-SD, CSOM-SD and GraphSOM.

The SOM-SD focuses only on clustering using the structural properties but CSOM-SD uses

the contextual information. These two approaches were used in the evaluation of INEX

IEEE dataset. GraphSOM is an extension of CSOM-SD that utilises a graph structure to

model the XML documents. By modelling the documents as graphs, the authors in [45]

claim that they could avoid information loss inherent in vector-based representation.

87

3.6.2.3 Clustering using different tensor decompositions

In order to understand the impact of the proposed progressive decomposition algorithm

for tensor in this thesis (detailed in Chapter 5), three clustering methods were used. These

clustering methods uses the same methodology as that of the proposed clustering method

using tensors, HCX-T, but instead of the proposed progressive decomposition algorithm,

the state-of-the-art decomposition algorithms such as CANDECOMP/PARAFAC (CP),

Tucker and MACH were used.

CP

This method uses the CANDECOMP/PARAFAC decomposition algorithm for decom-

posing the tensor model created using the structure and the content of the XML docu-

ments. In this method, the left singular matrix resulting from applying CP decomposition

on the tensor is used as an input for K-means clustering.

Tucker

This method uses the same tensor model as the Clustering using CP method but uses

the Tucker decomposition algorithm for decomposing the tensor.

MACH

This method uses the concepts of the most recent scalable and randomized decompo-

sition technique, MACH decomposition (discussed in the previous chapter), that will be

used to decompose the tensor. MACH randomly projects the original tensor to a reduced

tensor with smaller percentage of entries (10% from the original tensor as specified in

[112]) and then uses Tucker decomposition to decompose the reduced tensor.

Using these representations and clustering methods, the proposed clustering methods

will be benchmarked using the evaluation metrics detailed in this chapter.

88

3.7 Chapter summary

This chapter has presented the experimental design for the experiments that will be con-

ducted in this research. It has also analysed the various datasets – synthetic and real-life

datasets – based on their attributes, which will provide a good understanding of the exper-

imental results for both frequent pattern mining and clustering. The details of the choice

of the datasets were also covered. The evaluation metrics that will be used for benchmark-

ing the proposed methods were also presented, along with the benchmarks that will be

used for comparing the proposed frequent pattern mining and clustering methods detailed

in Chapter 4 and 5 respectively.

An extensive empirical study will be carried out by modifying the pre-processing steps,

the support threshold for frequent pattern mining and the clustering technique to achieve

best clustering results. This would result in a desirable clustering solution according to

the datasets used. In addition to this, the clustering results from both the phases will be

compared to understand the impact of clustering techniques using VSM and TSM.

89

Chapter 4

Frequent Pattern Mining of XML

Documents

4.1 Introduction

This chapter introduces the frequent pattern mining methods that have been developed

in this thesis to generate from a set of XML documents. This chapter proposes a suite of

frequent subtree mining methods which generate concise representations of induced and

embedded subtrees using the prefix-based pattern growth approach. An in-depth empirical

analysis is conducted to evaluate the efficiency of the proposed methods on both synthetic

and real-life datasets over the state-of-the-art frequent pattern mining methods.

Discovering frequent subtrees has practical significance such as improving user under-

standing about a data source, helping database indexing and access method design and

serving as the first step in classifying and clustering tree-structured data [67, 135]. The

main aim of applying frequent pattern mining on the structure of XML documents in this

research is to get the concise representation of the structure of the document collection.

91

This permits a reduction in the dimension for further application and in this research the

application is clustering. Hence, in this phase the frequent subtrees are generated for a

given user-defined support threshold. For the purpose of clustering, the content of the

XML documents is extracted using these frequent subtrees.

However, with the size of the document trees, the number of frequent subtrees usually

grows exponentially especially in situations when there are a very large number of nodes

in a tree. There are two consequences resulting from this exponential growth. Firstly, it

causes difficulties in analysing the results. The sheer number of generated patterns makes it

difficult to have a comprehensive explanation. In a way, it defeats the purpose of applying

frequent pattern mining, that is, getting frequent or common patterns that can explain the

dataset. Secondly, the frequent subtree mining algorithm could become intractable. An

attempt to overcome this problem by increasing the support threshold could result in the

loss of important and interesting patterns. Hence, to alleviate the explosion in the number

of frequent subtrees, this research aims to generate concise representations by restricting

the number of frequent subtrees.

This chapter begins with the basic pre-processing steps required to convert an XML

document into tree structure suitable for frequent subtree mining. It then discusses the

different types of subtrees based on their conciseness, relationships between the nodes and

the constraints that could be applied on them. The details of frequent subtree mining

methods using the pattern growth techniques are then provided to understand the basics

of the prefix-based pattern growth approach and the benefits of it over the “generate-

and-test” approach. The remainder of the chapter provides the details about the various

methods that are developed in this thesis to generate concise frequent subtrees using the

prefix-based pattern growth approach. It further details the individual methods using

these techniques to efficiently generate the different types of concise frequent subtrees.

92

The experimental section evaluates the proposed methods and compares them with the

state-of-the-art methods. Finally, the analysis of the experimental results is presented in

the discussion section.

4.2 Pre-Processing of the structure in XML documents

Most of the tree mining methods cannot be applied directly on XML documents. They

often need a pre-processing step, which uses the XML documents as input and outputs a

rooted, ordered, labelled tree for each document. The rooted, ordered and labelled tree

reflects the tree structure of the original XML document, where the node labels are the

tags of the XML document. Then the documents will be further transformed depending

on the document model used by the frequent pattern mining task.

The pre-processing of the structure of XML documents involves three sub-phases as

shown in Figure 4.1. They are:

� Parsing;

� Representation; and

� Duplicate branches removal.

Duplicate branches removal Parsing

Representation (Depth-first String encoding)

Remove duplicate paths by string matching

Convert document trees to Paths

Document trees

XML documents

Convert Paths to document trees

Figure 4.1: The pre-processing phase for structure of XML documents

93

Parsing

Parsing of XML documents can be done using a Simple API for XML (SAX) or a Doc-

ument Object Model (DOM) parser. Both of these parsers take a very different approach:

SAX parser provides a sequence of tags or terms in the order of occurrence in the XML

document; DOM parser provides a hierarchical object model in the form of a tree of nodes

where nodes are the tags in the XML document. Since the DOM parser preserves the

hierarchical relationship among the nodes, it is used in this research to extract the tree

structure of the XML documents.

Each XML document in the dataset is parsed and modelled as a rooted labelled ordered

document tree. The document tree is rooted and labelled since a root node always exists

in the document tree and all the nodes are labelled using the tag names. The left-to-right

ordering is preserved among the child nodes of a given parent in the document tree and

therefore they are ordered.

Representation

The document trees need to be represented in a way that is suitable for mining in the

subsequent phase. A popular representation for trees, the depth-first string format [22],

is used to represent the document trees. The depth-first string encoding traverses a tree

in the depth-first order. It represents the depth-first traversal of a given document tree in

a string like format where every node has a “-1” to represent backtracking.

In a document tree dataset, DT , a document tree DTi with only one node having

a label X, the depth-first string of DTi is S(DTi) =< X > . For a document tree

DTi with multiple nodes having labels X,Y,Z and K, the depth-first string of DTi is

S(DTi) =< XaY bZc − 1 . . . Kn − 1− 1− 1 > where the superscripts a, b, c, . . . , n are the

increasing positions of nodes in the pre-order traversal of the tree.

94

Duplicate branches removal

Many real-life datasets contain a large number of document trees having duplicate

branches. These duplicate branches carry repeated information and they cause additional

overheads in the mining process due to their redundancy. Hence, these duplicate branches

need to be removed. In order to remove the duplicate branches, the document tree is

converted in a series of paths. These duplicate paths are then identified from the series of

paths using string matching and removed. The remaining paths are combined together to

create the document trees.

4.3 Types of subtrees

In general, frequent pattern mining methods are applied to generate two types of subtrees,

induced and embedded, which preserve the parent-child and the ancestor-descendant re-

lationships among their nodes respectively. This section defines various types of concise

representations based on induced and embedded subtrees. The different types of subtrees

and the concepts of proposed frequent subtree mining approach will be explained using

the example document trees dataset DT shown in Table 4.1.

Table 4.1: Document tree dataset example (DT )Tree Id Tree Pre-order string1 < A1B2C3 − 1− 1E4 − 1− 1 >2 < A1B2C3 − 1− 1E4F 5 − 1− 1− 1 >3 < A1E2F 3 − 1G4 − 1− 1 >

4.3.1 Concise Frequent Induced (CFI) subtrees

There are two types of concise frequent induced subtrees, closed and maximal, as defined

in Chapter 2. This research defines these concise representations of induced subtrees as

follows.

95

Definition: Closed Frequent Induced (CFI) subtree

In DT , let there be two frequent induced subtrees DT ′ and DT ′′. The frequent induced

subtree DT ′ is closed of DT ′′ iff (1) DT ′ ⊃t DT ′′ where ⊃t denotes the supertree rela-

tionship, (2) supp(DT ′) = supp(DT ′′), (3) there exists no supertree for DT ′ having the

same support as that of DT ′ and (4) DT ′ is the induced supertree of DT ′′. This property

is called an induced closure and DT ′ is a CFI subtree in DT .

Definition: Maximal Frequent Induced (MFI) subtree

In DT , two frequent induced subtrees DT ′ and DT ′′ exist. The frequent induced subtree

DT ′ is said to be maximal of DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) ≥ supp(DT ′′), (2) there

exists no supertree for DT ′ having a support greater than that of DT ′ and (3) DT ′ is the

induced supertree of DT ′′. This property is called an induced maximality and DT ′ is the

MFI subtree in DT .

These two types of subtrees and their benefits will be explained using the running

example dataset DT in Table 4.1. Consider Table 4.2, which lists the frequent subtrees

using the prefix-pattern growth approach generated on a support threshold, min supp=2.

It can be seen that subtrees such as (< B1−1 >:2), (< C1−1 >:2), (< A1B2−1−1 >:2),

(< B1C2−1−1 >:2) are subtrees of (< A1B2C3−1−1−1 >: 2) with the same support.

On applying the closure property, it can be seen from Table 4.3 that there are 3 CFI

subtrees for the example DT in comparison to 13 frequent induced subtrees (as shown

in Table 4.2). It is interesting to note that closure has reduced the number of frequent

induced subtrees by three-fold in this example. As only the subtrees with the same support

are checked for closure, this results in no loss of information.

On the other hand, by applying the maximality, it can be seen from Table 4.4 that

there are 2 MFI subtrees in comparison to 13 frequent induced subtrees (as shown in Table

96

Table 4.2: Frequent induced subtrees generated from DT (in Table 4.1) using prefix-pattern growth approach

No. of Nodes Frequent Induced Subtrees

1 (< A1 − 1 >: 3), (< B1 − 1 : 2), (C1 − 1 >: 2),(< E1 − 1 >: 3)(< F 1 − 1 >: 2)

2 (< A1B2 − 1− 1 >: 2), (< B1C2 − 1 : 2),(< A1E2 − 1− 1 >: 3)(< E1F 2 − 1− 1 >: 2)

3 (< A1B2C3 − 1− 1− 1 >: 2), (< A1B2 − 1E3 − 1− 1 >: 3),(< A1E2F 3 − 1− 1− 1 >: 3)

4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)

Table 4.3: Closed Frequent Induced subtrees generated from DT (in Table 4.1)

No. of Nodes Closed Frequent Induced Subtrees

2 (< A1E2 − 1− 1 >: 3)

3 (< A1E2F 3 − 1− 1− 1 >: 2)

4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)

4.2). The subtree with two nodes, (< A1E2 − 1 − 1 >: 3), is an induced subtree of the

subtree with three nodes, (< A1E2F 3 − 1− 1− 1 >: 2). Hence, by applying maximality,

(< A1E2 − 1− 1 >: 3) could be eliminated as its supertree, (< A1E2F 3 − 1− 1− 1 >: 2),

is present.

Table 4.4: Maximal Frequent Induced subtrees generated from DT (in Table 4.1)

No. of Nodes Maximal Frequent Subtrees

3 (< A1E2F 3 − 1− 1− 1 >: 2)

4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)

The research now proposes a new type of concise frequent subtrees based on their

length, the Length Constrained Concise Frequent Induced subtree. The length constrained

concise frequent induced subtrees are used in this method for the following reasons:

� Extracting all the concise frequent induced subtrees is computationally expensive

for datasets with a high branching factor;

� All concise frequent induced subtrees are not required while utilising them in re-

trieving the content for clustering; and

� For some datasets the longer sized concise frequent induced subtrees could become

more specific and hence could impact on the quality of clustering solutions in using

these patterns.

97

Now the length constrained frequent closed and maximal induced subtrees will be

discussed.

Definition: Length Constrained Closed Frequent Induced (CFIConst) subtree

In DT , for a given support threshold,min supp, and a length constraint, const, let

there be two frequent subtrees DT ′ and DT ′′. The frequent subtree DT ′ is closed of

DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) = supp(DT ′′), (2) � DT ∗ ⊃t DT ′, such that

supp(DT ∗)=supp(DT ′), (3) DT ′ is the induced supertree of DT ′′, and (4) len(DT ′) ≤

const. This property is called the length constrained induced closure and DT ′ is a length

constrained CFI subtree in DT , denoted by CFIConst.

Definition: Length Constrained Maximal Frequent Induced (MFIConst) sub-

tree

In DT , for a given min supp and const, two frequent subtrees DT ′ and DT ′′ exist.

The frequent subtree DT ′ is closed of DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) ≥ supp(DT ′′)

, (2) � DT ∗ ⊃t DT ′, such that supp(DT ∗) ≥ min supp and supp(DT ∗) �= supp(DT ′), and

(3) DT ′ is the induced supertree of DT ′′ (4) len(DT ′) ≤ const. This property is called

the length constrained induced maximality and DT ′ is a length constrained MFI subtree in

DT , denoted by MFIConst.

Tables 4.5 and 4.6 use the running example in Table 4.1 to show the length constrained

closed and maximal frequent induced subtrees with the const ≤ 3 respectively.

Table 4.5: Length Constrained Closed Frequent Induced subtrees generated from DT(in Table 4.1)

#Nodes Length Constrained Closed Frequent Induced subtrees

2 (< A1E2 − 1− 1 >: 3)

3 (< A1B2C3 − 1− 1− 1 >: 2)(< A1E2F 3 − 1− 1− 1 >: 2)(< A1B2 − 1E3 − 1− 1 >: 2)

It should be noted that in this example the length constrained concise frequent in-

duced subtrees produce a greater number of concise frequent induced subtrees but the

98

Table 4.6: Length Constrained Maximal Frequent Induced subtrees generated fromDT (in Table 4.1)

#Nodes Length Constrained Maximal Frequent Induced subtrees

3 (< A1B2C3 − 1− 1− 1 >: 2)(< A1E2F 3 − 1− 1− 1 >: 2)(< A1B2 − 1E3 − 1− 1 >: 2)

length of each generated subtrees is controlled. This is due to the fact that the the

threshold condition for constraint length, const = 3, avoids the generation of the subtree

(< A1B2C3−1−1E4−1−1 >: 2) which is the supertree for (< A1B2C3−1−1−1 >: 2)

and (< A1B2 − 1E3 − 1 − 1 >: 2). However, if the constraint length (const) is set to 4

in this example, it could discover all the concise frequent subtrees. Hence, determining

the correct constraint length helps to avoid much information loss and to improve the

computational efficiency for the length constrained CFI frequent pattern mining methods.

4.3.2 Concise Frequent Embedded (CFE) subtrees

The concise frequent induced subtrees discussed in the previous subsection present a strict

relationship among their nodes by using only a parent-child relationship. This reduces the

possibility of discovering some hidden similarities between the trees. In order to increase

the prospect of identifying the hidden similarity, the embedded subtrees are utilised to

impose a less strict relationship by allowing the ancestor-descendant relationship among

their nodes [18, 135]. As the embedded subtrees identify hidden relationships, the number

of embedded subtrees generated is larger than the number of induced subtrees and this

could result in an information explosion when the average depth of the tree is large. In

order to control the number of embedded subtrees, it is essential to generate only the

concise representations called the Concise Frequent Embedded subtrees.

As was the case with concise frequent induced subtrees, this research defines four types

of concise frequent embedded subtrees using ancestor-descendant relationship among their

nodes.

99

Definition: Closed Frequent Embedded (CFE) subtree

In DT , let there be two frequent embedded subtrees DT ′ and DT ′′. The frequent em-

bedded subtree DT ′ is closed of DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) = supp(DT ′′), (2)

� DT ∗ ⊃t DT ′, such that supp(DT ∗)=supp(DT ′), and (3) DT ′ is the embedded supertree

of DT ′′. This property is called an embedded closure and DT ′ is the CFE subtree in DT .

Definition: Maximal Frequent Embedded (MFE) subtree

In DT , consider two frequent embedded subtrees DT ′ and DT ′′ exist. The frequent

embedded subtree DT ′ is said to be maximal of DT ′′ iff (1) DT ′ ⊃t DT ′′, (2) � DT ∗ ⊃t

DT ′, such that supp(DT ∗) ≥ min supp and supp(DT ∗) �= supp(DT ′), and (3) DT ′ is the

embedded supertree of DT ′′. This property is called an embedded maximality and DT ′ is

the MFE subtree in DT .

Definition: Length Constrained Closed Frequent Embedded (CFEConst) sub-

tree

In DT , for a given min supp and const, two frequent subtrees DT ′ and DT ′′ exist.

The frequent embedded subtree DT ′ is closed of DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) =

supp(DT ′′), (2) � DT ∗ ⊃t DT ′, such that supp(DT ∗)=supp(DT ′), and (3) DT ′ is the

embedded supertree of DT ′′ (4) len(DT ′) ≤ const. This property is called an embedded

closure and DT ′ is the length constrained CFE subtree in DT .

Definition: Length Constrained Maximal Frequent Embedded (MFEConst)

subtree

In DT , for a given min supp and const, there exists two frequent subtrees DT ′ and

DT ′′. The frequent subtree DT ′ is maximal of DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) =

supp(DT ′′), (2) � DT ∗ ⊃t DT ′, such that supp(DT ∗) ≥ min supp and supp(DT ∗) �=

supp(DT ′), and (3) DT ′ is the embedded supertree of DT ′′, and (4) len(DT ′) ≤ const.

100

This property is called an embedded maximal and DT ′ is the length constrained MFE

subtree in DT .

Considering the running example dataset DT in Table 4.1, it can be seen that there are

15 frequent embedded subtrees inDT , as listed in Table 4.7, as compared to the 13 frequent

induced subtrees listed in Table 4.2. It can be noted that (< A1B2C3−1−1E4−1−1 >: 2)

is the supertree of the 8 subtrees that have the same support value of 2 and can replace

them. Similarly, (< A1E2 − 1− 1 >: 3) and (< A1F 2 − 1− 1 >: 2) can also replace their

respective subtrees having the same support.

Table 4.7: Frequent embedded subtrees generated from DT (in Table 4.1) using prefix-pattern growth methods

#Nodes Frequent Embedded Subtrees

1 (< A1 − 1 >: 3), (< B1 − 1 >: 2), (< C1 − 1 >: 2),(< E1 − 1 >: 3), (< F 1 − 1 >: 2)

2 (< A1B2 − 1− 1 >: 2), (< B1C2 − 1− 1 >: 2),(< A1C2 − 1− 1 >: 2), (< A1E2 − 1− 1 >: 3), (< A1F 2 − 1− 1 >: 2)

3 (< A1B2C3 − 1− 1− 1 >: 2), (< A1B2 − 1E3 − 1− 1 >: 2),(< A1C2 − 1E3 − 1− 1 >: 2), (< A1E2F 3 − 1− 1− 1 >: 2)

4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)

The CFE subtrees result set will be freqT (DT ) : (< A1E2 − 1 − 1 >: 3), (< A1F 2 −

1 − 1 >: 2), (< A1B2C3 − 1 − 1E4 − 1 − 1 >: 2) as shown in 4.8. The MFE subtrees

result set will be freqT (DT ) : (< A1E2 − 1 − 1 >: 3), (< A1F 2 − 1 − 1 >: 2), (<

A1B2C3 − 1 − 1E4 − 1 − 1 >: 2) as shown in 4.9. Tables 4.10 and 4.11 show the length

constrained closed and maximal embedded subtrees generated for the sample XML dataset

with the length constraint, const=3.

Table 4.8: Closed Frequent Embedded (CFE) subtrees generated from DT (in Table4.1)

#Nodes Closed Frequent Embedded Subtrees

2 (< A1E2 − 1− 1 >: 3), (< A1F 2 − 1− 1 >: 2)

4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)

Table 4.9: Maximal Frequent Embedded (MFE) subtrees generated from DT (in Table4.1)

#Nodes Frequent Subtrees

2 (< A1F 2 − 1− 1 >: 2)

4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)

101

Table 4.10: Length Constrained Closed Frequent Embedded (CFEConst) subtreesgenerated from DT (in Table 4.1)

#Nodes Length Constrained Closed Frequent Embedded Subtrees

2 (< A1E2 − 1− 1 >: 3)

3 (< A1B2C3 − 1− 1− 1 >: 2)(< A1B2 − 1E3 − 1− 1 >: 2)(< A1C2 − 1E3 − 1− 1 >: 2)(< A1E2F 3 − 1− 1− 1 >: 2)

Table 4.11: Length Constrained Maximal Frequent Embedded (MFEConst) subtreesgenerated from DT (in Table 4.1)

#Nodes Length Constrained Maximal Frequent Embedded subtrees

3 (< A1B2C3 − 1− 1− 1 >: 2)(< A1B2 − 1E3 − 1− 1 >: 2)(< A1C2 − 1E3 − 1− 1 >: 2)(< A1E2F 3 − 1− 1− 1 >: 2)

The background of the prefix-based pattern growth for frequent subtree mining will

be provided in the next section to understand the process of prefix-based pattern growth

technique and its benefits over the “generate-and-test” approach.

4.4 Frequent subtree mining: Background

The prefix-based pattern growth for mining frequent subtrees involves the following three

phases:

� The 1-Length frequent subtree generation;

� Projecting the dataset using the prefix trees; and

� Mining the prefix tree projected dataset

4.4.1 The 1-Length frequent subtree generation

For a given user-defined minimum support (min supp), the prefix-based subtree growth

technique starts with a scan of the document tree dataset DT to determine the 1-Length

frequent subtrees that have a support greater than the min supp. A 1-Length frequent

subtree containing a single node is represented using the following subtree pre-order string

format as SubX = (< Xa − 1 >: Supp). The subtree pre-order representation includes an

102

element called Supp to indicate the support value of the subtree.

With the running example in Table 4.1, this step when applied with an user-defined

support threshold, min supp=2, results in the following 1-Length frequent subtrees: (<

A1 − 1 >:3), (< B1 − 1 >: 2), (< C1 − 1 >: 2), (< E1 − 1 >: 3) (< F 1 − 1 >:2). The

subtree (< G1−1 >:1) is infrequent as it occurs only once, which is less than the min supp

value, and hence it is not included in the output.

Definition: Prefix-Tree

Let there be a document tree DTp with m nodes and let Tp be a tree with n nodes where

n ≤ m. The pre-order scanning of the document tree DTp from its root until its n-th node

results in a tree Tj. If Tj is isomorphic (or structurally identical) to Tp, then Tp is called

the prefix-tree of DTp .

Figure 4.2 shows the prefix-trees for the document tree DTp illustrated in 4.2(a). The

4 prefix-trees for A containing 1, 2, 3 and 4 nodes are identified for the document tree

DTp following the pre-order string representation. As the node ‘E’ follows the prefix tree

< A1B2C3 − 1− 1− 1 > it does not have a prefix-tree < A1E2 − 1− 1 >.

C

A

B

A A

B

A

EB

C

A

EB

C

(a)

(b)

Figure 4.2: (a) a document tree DTp; (b) Prefix trees of (a)

103

4.4.2 Projecting the dataset using the prefix trees

The next step in this process involves projecting the dataset using the prefix trees. The

process is started by using the 1-Length frequent subtrees as prefix-trees. To build the

Tp-prefix-projected dataset, every document tree in DT is checked to establish whether it

contains the prefix-tree Tp. If a document tree DTp contains Tp then its projected instance

for Tp is constructed.

Definition: The Prefix-tree projected instance

Consider a prefix tree Tp with n nodes. If a document tree DTp ∈ DT with m nodes

exists, m ≥ n, Tp ⊂t DTp; then the Tp-prefix projected instance of DTp is the pre-order

scanning of DTp from n+1 node to m.

The Prefix-tree projected dataset can be defined as follows.

Definition: The Prefix-tree projected dataset

A Tp-prefix-tree projected dataset is obtained by constructing the Tp-prefix-tree projected

instances for all the document trees in DT.

In the running example with min supp=2, using the generated 1-Length frequent

subtrees as prefix-trees, DT is projected. The prefix-trees of the document tree (DT1)

with a tree-id of 1 for 1-Length frequent subtree (< A1 − 1 >:3) is given by < A1 − 1 >,

< A1B2 − 1− 1 >, < A1B2C3 − 1− 1− 1 >, < A1B2C3 − 1− 1E4 − 1− 1 >.

Tables 4.12, 4.13 and 4.14 provide the projected instances dataset of the prefix-trees

< A1−1 >, < B1−1 > and < A1B2−1−1 > respectively. To improve the efficiency, the

projected instances from the infrequent 1-Length subtrees are eliminated. It can be noted

in Table 4.12 that the tree with Tree Id 3 does not contain the node ‘G’ as it is infrequent

and hence this node is eliminated in projection. The generated projected instances are

104

mined using the technique detailed in the following subsection.

Table 4.12: < A1 − 1 > projected instances dataset

Tree Id Pre-order strings of Trees

1 B2C3 − 1− 1E4 − 1

2 B2C3 − 1− 1E4F 5 − 1− 1

3 E2F 3 − 1− 1

Table 4.13: < B1 − 1 > projected instances dataset


1 C3 − 1

2 C3 − 1

Table 4.14: < A1B2 − 1− 1 > projected instances dataset


1 C3 − 1

2 C3 − 1

Mining the prefix-tree projected dataset

As a next step in the prefix-pattern growth, each of the prefix-tree projected datasets

are mined to identify the Growth Node (GN).

Definition: Growth Node (GN)

Given two prefix-trees Tp and T ′p with m and m+1 nodes respectively, where Tp is the

prefix of T ′p. If a node n occurs in T ′

p, but not in Tp then the node n is the Growth Node

(GN) of Tp with respect to T ′p.

If a GN is frequent then the prefix-tree with the GN forms the frequent subtree.

For each of the frequent GNs the corresponding projection is constructed and mined

recursively until there are no more frequent GNs to be projected.

Mining the prefix-tree projected dataset in the running example containing the prefix-

tree dataset of < A1−1 >, there are two GNs, namely the nodes ‘B’ and ‘E’ in Table 4.12,

as they are the children of < A1 − 1 >; this example uses only a parent-child relationship

and not an ancestor-descendant relationship. If the latter relationship is considered then

105

‘C’ and ‘F ’ will be also be considered as GNs. They are frequent and have a support value

greater than min supp. These frequent GNs combine with the prefix-tree < A1 − 1 > to

form their corresponding new prefix-trees < A1B2 − 1 − 1 >, < A1E2 − 1 − 1 >. These

prefix-trees will be used to identify the GNs. For instance, for the partitioned dataset

< A1 − 1 > provided in Table 4.12, the GNs are nodes labelled ‘B’ and ‘E’ as they

occur as first nodes in the projected instance. The support of GNs ‘B’and ‘E’ is 2 and 3

respectively in the < A1 − 1 > prefix-projected dataset. Hence ‘B’ and ‘E’ are frequent

GNs. Using these two frequent GNs, two separate projections are constructed and mined

for the frequent subtrees. Table 4.14 shows the projection for < A1B2 − 1 − 1 > but the

projection for < A1B2 − 1C3 − 1− 1 > is empty and hence this projection is terminated.

The projection for < A1B2 − 1 − 1 > is then mined recursively until all the frequent

subtrees are identified.

So far, the basic process of generating frequent subtrees using the prefix-pattern growth

technique has been discussed. The following section provides the details of the techniques

for generating concise frequent subtrees on both induced and embedded subtrees. It will

also present the algorithm using these techniques and the frequent subtree mining methods

for generating the different types of subtrees.

4.5 Concise frequent subtree mining: Proposed techniques

Unlike the situation in itemset mining, generating closed or maximal subtrees from trees

is a challenge, due to the presence of hierarchical relationships and the need to preserve

these relationships while generating concise frequent subtrees. A naıve approach to the

generation of closed or maximal frequent subtrees is first to generate all the frequent

subtrees and then to eliminate the subtrees based on their support by checking the closure

or maximality. This is an expensive task when there are a large number of frequent

106

subtrees generated or when the frequent subtree mining could not be completed. Also,

by identifying concise frequent subtrees using this naıve approach, the process becomes

an additional step and results in computational overhead to the frequent subtree mining

process. Hence, it is essential to identify an efficient method that can provide the concise

result set as well as improve the efficiency of the frequent subtree mining process. There

are a number of concise pattern mining approaches proposed in the frequent itemset and

sequential mining [117, 127]. Unlike the itemset or sequential mining, trees can have

multiple branches and hence closure using the traditional techniques cannot be used for

tree structured datasets.

This thesis proposes two techniques for effectively mining concise frequent subtrees

using the pattern-growth approach discussed below. These involve:

1. Search space reduction using the backward scan; and

2. Node extension concise checking.

4.5.1 Search space reduction using the backward scan

This technique is applied after the generation of 1-Length frequent subtrees. It conducts

a backward scan of the document tree dataset, DT , to reduce the search space using the

following conditions:

Condition 1: Backward scan

Let there be two 1-Length prefix trees Tp and T ′p, each having a node labelled v and v′

respectively for DT . If v′ is the ancestor node or parent node of v in all trees in DT then

the projection of Tp is stopped as the projection of the subtree T ′p based on v′ will include

all the subtrees generated using the prefix tree Tp.

107

As the projection of the parent or the ancestor node includes the projection of the

child node or the descendant node respectively, this condition aids in reducing the number

of projections and hence the search space is also reduced effectively. However, by applying

this condition only, complete concise frequent subtrees cannot be obtained. Due to the

nature of the repeated projections, there is a possibility that some of the subtrees generated

are not concise. In order to check whether a subtree is concise or not for the generated

frequent subtrees, node extension concise checking is performed.

4.5.2 Node extension concise checking

The concise checking allows testing for the existence of either of the two properties -

maximal or closure in the tree dataset. This checking is applied to identify a subtree that

is extended by the node as concise or not. If the extended subtree is found as concise then

the original subtree is not considered concise.

According to the definitions of the concise frequent subtrees, a prefix-tree Tp is not

concise if at least one prefix-tree T ′p exists with the same support as that of Tp or with

a support greater than or equal to min supp. With the use of the prefix-based subtree

growth technique to generate frequent subtrees, the prefix-tree T ′p can occur in two possible

ways:

1. In the same prefix-projected dataset of Tp; and

2. In a different prefix-projected dataset of Tp.

Considering the example tree dataset, the prefix-tree T ′p =< A1B2 − 1− 1 > occurs in

the same prefix-projected dataset as that of Tp =< A1 − 1 >. The conciseness of Tp with

respect to T ′p can be checked by using the Growth Node (defined in Section 4.3) extension

108

closure checking or Growth Node extension maximality checking condition, according to

the type of concise trees that are being generated.

Condition 2a: Growth Node extension closure checking

A prefix tree Tp can be extended to T ′p in the same prefix-projected dataset using its

GNs. If any of the GNs for a given prefix-projected instance have the same support as

Tp, then Tp is not closed.

Condition 2b: Growth Node extension maximality checking

A prefix tree Tp can be extended to T ′p in the same prefix-projected dataset using its

GNs. If any of the GNs for a given prefix-projected instance have support greater than

min supp, then Tp is not maximal.

The growth node extension checking is not a computationally expensive step as it

involves checking only the support of the GN , which is a 1-Length frequent subtree in the

projected dataset. This technique can be used to reduce the number of frequent subtrees

to generate concise frequent subtrees.

Let us consider the next type of extension, where the extension of Tp occurs in a

different prefix-projected dataset from Tp. To check for the conciseness of the prefix-trees,

the following conditions are applied.

Condition 3a: Ancestor Node extension closure checking

If a prefix-tree Tp with m nodes and a prefix-tree T ′p exists with the common m nodes

and an additional node b having the same support as that of Tp in a different prefix-

projected instance from T ′p, then Tp is not closed and b is the ancestor node extension of

Tp.

109

Condition 3b: Ancestor Node extension maximality checking

If a prefix-tree Tp with m nodes and a prefix-tree T ′p exists, with the common m nodes

and an additional node b having the support ≥ min supp in a different prefix-projected

instance from T ′p then Tp is not maximal and b is the ancestor node extension of Tp.

In order to efficiently check for conciseness for ancestor node extensions, a technique

called “maintain-and-test” is deployed. A naıve approach to checking for conciseness is to

check for all the ancestor node extension events based on their support. However, it is an

expensive operation as it includes a number of checks. To reduce the number of checks,

a parameter, the sum of the tree IDs, is included to check for closure or maximal. To

apply this technique, first check whether for a given prefix-tree Tp there exists an ancestor

node extension of a node b in a different projected dataset, resulting in T ′p having the

same support and sum of tree IDs as Tp. If it exists then they are checked for closure or

maximality. The “maintain-and-test” approach reduces the number of checks as it avoids

checking all the prefix-trees with the same support and hence reduces the computational

overhead. Also, this concise checking technique is applied when the new frequent subtrees

are generated, which helps to reduce the number of concise checks. The algorithm and its

recursive function is presented in Figures 4.3 and 4.4 respectively.

4.6 Methods using the proposed techniques for generating

concise frequent subtrees

This section discusses the use of the proposed techniques explained in the preceding sec-

tion in order to generate the different types of concise frequent subtrees. Figure 4.5 shows

the classification of the proposed methods for generating the two types of concise repre-

sentations and based on the node relationships of the subtrees. The following subsections

110

Input : Document Dataset:D, Document Tree Dataset:DT, MinimumSupport:min supp, Length Constraint:const

Output: Concise Frequent Subtrees:CFS

begin1. Scan DT and find all 1-Length frequent subtrees f = {f1, f2, ..., fn};2. for the node b in every frequent subtree fi in f do

if there exists the same ancestor node c for fi in all the document trees thenDo not construct the prefix-tree projected dataset for fi;

elsei. Find all occurrences of fi in DT and construct < fi − 1 > - projecteddataset (i.e. ProDS(DT,< fi − 1 >)) through collecting allcorresponding Project- Instances in DT ;ii. Apply the Fre(< fi − 1 >, 1, P roDS(DT,< fi − 1 >),min sup, supp(fi)) to mine the projected dataset until no more GNsthat can be found;iii. Obtain concise frequent subtrees, CFSProDS from the Fre functionon the projected dataset ProDS;iv. Insert CFSProDS into CFS;

end

end

end

Figure 4.3: Algorithm for generating concise frequent subtrees

discusses how each of the proposed frequent pattern mining methods utilise the proposed

concise generation techniques earlier in detail.

4.6.1 Generating concise frequent induced subtrees

As induced subtrees preserve only the parent-child relationships, generating concise fre-

quent induced subtrees using the techniques explained in the previous section requires

imposing only the parent-child relationship and not the ancestor-descendant relationship.

For all the methods for concise frequent induced subtrees generation, the process begins

with the generation of 1-Length frequent subtrees and the identification of the prefix-tree

projected datasets based on the generated 1-Length frequent subtrees. The search space

reduction condition 1 is then applied using the generated 1-Length frequent subtrees.

111

Function Fre (T p, n, ProDS(DT, T p), min supp, prepat supp)

Input : prefix-tree:Tp, length of Tp:n, <Tp>-projected dataset:(ProDS(DT,Tp)), minimum support threshold:min supp, support of theprevious subtree used to generate this projected dataset:prepat supp

Output: Concise Frequent Subtrees (CFSProDS)

begin1. Scan ProDS(DT, Tp) once to find all the 1-Length frequent GNs(GN0,...,k)according to Condition 1;2. Set output=true;3. Count the support of all GNs;4. if supp(GN0||GN1, ..., ||GNk) == supp(Tp) then

The subtree is not a CFS, output = false;end5. for each GNi in GN do

if GNi is frequent theni. Extend Tp with GNi to form the prefix tree T ′

p;

ii. if output thenInsert T ′

p into CFSProDS;

end

elsei. Check T ′

p for occurrence of any of its subtree with the same support

and sum of tree IDs in the output ;ii. if there exists any subtree for T ′

p then

Remove the subtree of T ′p and insert T ′

p into CFSProDS;

end

end

end6. Find all occurrences of GNi in ProDS(DT, Tp), construct the< T ′

p >-projected database (i.e. ProDS(DT, T ′p)) through collecting all

corresponding Project-Instances in ProDS(DT, Tp);7. Call Fre(T ′

p, n+ 1, P roDS(DT, T ′p),min supp, prepat supp) using the newly

created T ′p;

end

Figure 4.4: Function Fre for generating concise frequent subtrees

112

Proposed Methods

Prefix-based pattern growth methods

For Concise Induced subtrees

For Concise Embedded subtrees

Maximal LengthConstrained

LengthConstrained

PCITMINER PMITMINER PCITMINERCONST

PMITMINERCONST

PCETMINERCONST

PMETMINERCONST

Closed

PCETMINER

Maximal

PMETMINER

Closed

CFI MFI CFIConst MFIConst CFE MFE CFEConst MFEConst

Concise Frequent subtrees

Node relationship

Conciseness

Figure 4.5: Classification of the proposed methods

Using these 1-Length frequent subtrees, the frequent node labels in the document trees

are identified. The frequent node labels in every document tree are checked for their

parent nodes. If the given 1-Length frequent subtrees contain the same parent node in

all the subtrees then the 1-Length frequent subtrees will not be used in projecting the

dataset as the projections created by the parent nodes of the 1-Length frequent subtrees

include the projections created by the 1-Length frequent induced subtrees. Hence, these

1-Length frequent subtrees can be removed from the set of 1-Length frequent subtrees,

thereby reducing the search space

In the example dataset, the subtree having the node labelled ‘A’ is a root node in

all the trees in the DT , so it is not checked for its parent node. However, the subtrees

containing the internal nodes ‘B’, ‘C’, ‘E’ and ‘F’ are checked for their parent nodes in

all the document trees in DT . This checking reveals that the parent node is ‘A’ in all

the trees; hence there is no need to project the subtrees (< B1 − 1 >: 2), (< C1 − 1 >:

2), (< E1 − 1 >: 3) (< F 1 − 1 >:2), as the projection of (< A1 − 1 >:3) includes the

projections of all the other subtrees. By excluding the projections of the internal nodes,

the number of subtrees and the number of projections required are significantly reduced.

Due to the reduced search space, the efficiency of the method can be improved.

113

The concise checking techniques are applied using the generated 1-Length frequent

subtrees. The concise checking techniques depend on the type of subtree generated hence

they are described separately for each type of subtrees.

4.6.1.1 Prefix-based Closed Induced Tree Miner (PCITMiner)

As the PCITMiner generates Closed Frequent Induced (CFI) subtrees, the next task fo-

cuses on utilising the Growth Node (GN) extension closure checking condition (condition

2a) to generate the CFI subtrees. The GN for induced subtrees is based on the parent-

child relationship, so these nodes are essentially the child nodes of the prefix tree Tp. To

check for closure, the support of each of the GN is stored. If any GN exists that has the

same support as that of Tp, then Tp is not output.

As can be seen from Table 4.12 for the < A1−1 > projected dataset, the growth nodes

are ‘B’ and ‘E’. The supp(‘B’)=2 and supp(‘E’)=3. It can be seen that the GN , ‘E’, has

the same support as that of the prefix tree < A1 − 1 >, hence this subtree is not output

as a CFI subtree. This is due to the fact that the prefix-tree < A1 − 1 > with the GN

‘E’ could result in a subtree < A1E2−1−1 > which will be the supertree for < A1−1 >,

having the same support as that of < A1 − 1 >.

Finally, the condition 3a for ancestor node extension is used in PCITMiner as a parent

node extension and the closure checking is applied when the prefix-tree occurs in a dif-

ferent projected dataset. To efficiently check for closure for parent node extensions using

the “maintain-and-test” technique, the sum of the tree IDs, is included. To apply this

technique, it must first be checked whether, for a given prefix-tree Tp , a parent node

extension of a node b exists in a different projected dataset, resulting in T ′p having the

same support and sum of tree IDs as Tp. If it exists then it is checked for closure. The

“maintain-and-test” approach reduces the number of checks as it avoids checking all the

114

prefix-trees with the same support, and thus reduces the computational overhead.

From Table 4.12, Tp =< A1B2−1E3−1−1 > and prefix-tree T ′p =< A1B2C3−1E4−1−

1−1 >, it can be noted that T ′p is generated from the prefix-tree< A1B2C3−1−1−1 > and

not from Tp. Hence T′p occurs in a different projected dataset from Tp =< A1B2−1−1 >.

This shows that the condition 3a helps to identify this type of ancestor node closure.

4.6.1.2 Prefix-based Maximal Induced Tree Miner (PMITMiner)

PMITMiner adopts similar techniques to PCITMiner but it generates the MFI subtrees

that have maximality in their conciseness and parent-child relationship among their nodes.

To generate MFI subtrees using PMITMiner, the next task after applying search space

reduction technique is to apply a Growth Node (GN) extension maximality checking

condition (condition 2b). The GN for induced subtrees is based on the parent-child

relationship; these nodes are essentially the child nodes of the prefix tree Tp. To check for

maximality, if any GN having a support greater than the min supp is present then Tp is

not output as it is not maximal. This check is different from closure check where GNs

with the same support as that of the prefix tree Tp is checked.

Table 4.12 shows that the < A1 − 1 > projected dataset, the growth nodes are ‘B’

and ‘E’, with supp(‘B’)=2 and supp(‘E’)=3 respectively. As the conciseness is based on

maximality, by applying condition 2b, the GNs will be checked whether they are frequent

(support greater than min supp) instead of checking them for the same support as in

closure. Hence, the GN , ‘B’, having a support greater than min supp, is frequent and

the prefix tree < A1 − 1 > is not maximal.

Finally, the ancestor node extension maximality checking condition (condition 3b) for

checking ancestor node extension is used in PMITMiner as a parent node extension (as

115

induced subtrees maintains parent-child relationship among their nodes) and the maxi-

mality checking is applied when the prefix-tree occurs in a different projected dataset.

The “maintain-and-test” technique is utilised and the testing is based on the min supp

using only their support values as the sum of the tree-IDs is not required for checking this

condition. To apply this technique, it is first checked whether for a given prefix-tree Tp,

a parent node extension of a node b exists in a different projected dataset resulting in T ′p

having a support greater than min supp. If it exists, then only the prefix-tree is checked

for maximality.

The discussion on the concise frequent induced subtrees leads to the discussion of the

length constrained concise induced subtrees, PCITMinerConst and PMITMinerConst.

4.6.1.3 Length Constrained Prefix-based Closed Induced Tree Miner (PCIT-

MinerConst)

The PCITMinerConst method adopts the same pruning and extension checking techniques

as PCITMiner however each generated subtree is checked for its length. This length is

an user-defined parameter. Usually this parameter is set to a value greater than 2 as a

subtree with a length of 1 will be a node and a subtree with a length of 2 will be a path.

PCITMinerConst method begins with the search space reduction technique similar to

PCITMiner in which only the 1-Length frequent subtrees are generated. Hence, the con-

straint checking is not carried out in this technique. The constraint checking is performed

after applying both of the node extension checking techniques.

After applying condition 2a for growth node extension closure checking, if the Tp tree

is frequent and the length of the Tp tree is greater than the user-defined length threshold,

then the generated Tp is output as a CFIConst subtree; lastly the projections for the

116

projected dataset are terminated when the generated Tp is equal to the length threshold.

Finally, the condition 3a for ancestor node extension is used in PCITMinerConst as

a parent node extension and the closure checking is applied when the prefix-tree occurs

in a different projected dataset along with the length threshold. The “maintain-and-test”

technique that uses the sum of the tree IDs and the support is included to check for

closure. To apply this technique, firstly check whether for a given prefix-tree Tp, a parent

node extension of a node b exists in a different projected dataset, resulting in T ′p having

the same support and sum of tree IDs as Tp. If it exists then it is checked for closure. This

process is repeated until there are no more CFIConst subtrees that could be generated.

4.6.1.4 Length Constrained Prefix-based Maximal Induced Tree Miner (PMIT-

MinerConst)

Similar to the PCITMinerConst method, PMITMinerConst adopts the same pruning and

extension checking techniques for generating the length constrained maximal frequent

induced subtrees, MFIConst subtrees. The search space reduction technique generates

only the 1-Length frequent subtrees. Hence, the constraint checking is not carried out.

The constraint checking is performed after both of the node extension checking techniques.

After applying conditions 2b and 3b, if the length of the Tp tree is greater than the

user-defined length threshold then the generated Tp is output as a MFIConst and then

the projections for the projected dataset are terminated. Finally, the condition 3b for

ancestor node extension is used in PMITMinerConst as a parent node extension and the

maximality checking is applied when the prefix-tree occurs in a different projected dataset.

The “maintain-and-test” technique is utilised and the testing is based on the min supp.

117

4.6.2 Generating concise frequent embedded subtrees

Generating concise frequent embedded subtrees from the proposed techniques for generat-

ing concise frequent subtrees involves utilising ancestor-descendant relationships. However,

checking these relationships are much more difficult than the parent-child relationships as

there could be different levels of descendants for a given ancestor.

Similar to induced subtrees, the generation of 1-Length frequent subtrees is conducted

by scanning the document dataset, because no hierarchical relationships exist in these 1-

Length frequent subtrees. Applying the search space reduction condition involves checking

the ancestor nodes. However, checking all the ancestor nodes is an expensive task. Hence,

a heuristic of searching the ancestor nodes will be proposed. According to it, if a node

n exists that has a support equal to or greater than the support of the descendant node,

then it is adopted. This heuristic helps to reduce the number of ancestors that have to be

searched.

Another simple heuristic checks if a node n appears in every DT . If it does then this

node is not checked for its ancestors, since there is a possibility that it could be a root

node. This heuristic is applicable only on specific datasets which will be discussed further

in Section 4.8.

For example, among the 1-Length frequent embedded subtrees (< A1 − 1 >: 3), (<

B1 − 1 >: 2), (< C1 − 1 >: 2), (< E1 − 1 >: 3)(< F 1 − 1 >: 2), the subtree having

the node labelled ‘A’ is a root node in all the trees in DT , so, it is not checked for

its ancestors. Nevertheless the subtrees containing the internal nodes ‘B’, ‘C’, ‘E’ and

‘F ’ are checked for their ancestor nodes in all the document trees in DT . This reveals

that the ancestor node is ‘A’ in all the trees so there is no need to project the subtrees.

The projection of (< A1 − 1 >: 3) includes the projections of the following subtrees:

118

(< B1 − 1 >: 2), (< C1 − 1 >: 2), (< E1 − 1 >: 3)(< F 1 − 1 >: 2).

4.6.2.1 Prefix-based Closed Embedded Tree Miner (PCETMiner)

The Growth Node(GN) extension closure checking condition (condition 2a) is applied to

identify whether any GNs exist in the same projected dataset that have the same support

as the prefix-tree Tp. This step involves checking whether any of the GNs have a support

equal to Tp. If this is the case, then the Tp is not output as a CFE subtree.

From the running example, it can be seen that when the prefix-tree projected dataset

contains the prefix-tree dataset of < A1 − 1 >, it can be found that the node ‘E’ has the

same support as that of its prefix-tree. Hence, the < A1 − 1 > is not a CFE and hence it

is not output.

The ancestor node extension closure checking condition (condition 3a) is applied to

verify the ancestor nodes in a different projected dataset. If in one prefix-projected dataset,

a prefix-tree Tp with m nodes exists and in a different prefix-projected dataset, a prefix-

tree T ′p exists with the common m nodes and an additional node b having the same support

as that of Tp then Tp is not closed as the additional node b in T ′p is the ancestor node

extension of Tp.

The “maintain-and-test” technique uses the sum of tree IDs and the support as a

hashing function to identify whether any Tp ⊃t T′p exist that have the same support as

that of Tp. If it exists then the Tp is not output due to the closure property.

4.6.2.2 Prefix-based Maximal Embedded Tree Miner (PMETMiner)

The Growth Node (GN) extension closure checking condition (condition 2b) is applied

to identify whether any GNs exist in the same projected dataset that are frequent with

119

the prefix-tree Tp. This step is a computationally inexpensive step as it involves checking

whether the GN is frequent or not. If any frequent GN exists then Tp is not output as a

MFE subtree.

From the running example, it can be seen that the prefix-tree projected dataset con-

tains the prefix-tree dataset of < A1 − 1 >. Several GNs, namely the nodes ‘B’, ‘C’, ‘E’

and ‘F ’, can be found. They are frequent so the < A1 − 1 > is not a MFE and hence it is

not output.

The ancestor node extension maximality checking condition (condition 3b) is applied

to verify the ancestor nodes in a different projected dataset. If in one prefix-projected

dataset, a prefix-tree Tp with m nodes exists and in a different prefix-projected dataset, a

prefix-tree T ′p exists with the common m nodes and an additional node b having a support

greater thanmin supp then Tp is not maximal as the additional node b in T ′p is the ancestor

node extension of Tp. As with PCETMiner, the “maintain-and-test” technique is used .

Instead of using both the sum of tree IDs and the support value, only the support value

as a hashing function in this technique to identify any frequent T ′p, in which case the Tp

is not output.

4.6.2.3 Length Constrained Prefix-based Closed Embedded Tree Miner (PCETMin-

erConst)

The PCETMinerConst method adopts the same techniques as PCETMiner; however each

generated subtree is checked for its length. As in the search space reduction technique,

only the 1-Length frequent subtrees are generated. Hence, the constraint checking is

not carried out. The constraint checking is performed after both of the node extension

checking techniques.

120

After applying condition 2a and 3a, if the length of the Tp tree is greater than the

user-defined length threshold then the generated Tp is output as a CFEConst subtree and

the projections for the projected dataset are then terminated.

Finally, condition 3a for ancestor node extension is used in PCETMinerConst as an

ancestor node extension and the closure checking is applied when the prefix-tree occurs

in a different projected dataset. The “maintain-and-test” technique is utilised and the

testing is based on both the sum of tree IDs and the support of the subtrees. The pruning

option provided by the length threshold helps to terminate the frequent subtree mining

process earlier than PCETMiner.

4.6.2.4 Length Constrained Prefix-based Maximal Embedded Tree Miner

(PMETMinerConst)

Similar to the PCETMinerConst method, PMETMinerConst adopts the same pruning and

extension checking techniques as PMETMiner; however each generated subtree is checked

for its length. As in the search space reduction technique, only the 1-Length frequent

subtrees are generated. Hence, the constraint checking is not carried out. The constraint

checking is performed after both of the node extension checking techniques.

After applying conditions 2b and 3b, if the length of the Tp tree is greater than the

user-defined length threshold, the generated Tp is output as a MFE and then the projections

for the projected dataset are terminated.

Finally, the condition 3b for ancestor node extension is used in PMETMinerConst and

the maximality checking is applied when the prefix-tree occurs in a different projected

dataset. The “maintain-and-test” technique is utilised and the testing is based on the

min supp.

121

To understand the effectiveness of the proposed methods for concise frequent sub-

tree mining, they are benchmarked against other state-of-the-art frequent subtree mining

methods on both synthetic and real-life datasets using the evaluation measures discussed

in Chapter 3.

4.7 Empirical evaluation

The experiments for frequent subtree mining were conducted to understand the effective-

ness of the proposed concise frequent subtree mining methods over the prefix-based pat-

tern growth methods for frequent subtree mining such as PrefixTreeISpan [141] and Prefix-

TreeESpan [140], which generate induced and embedded subtrees respectively. In addition,

experiments were conducted to understand the effectiveness of the proposed methods over

other state-of-the-art apriori-based generate-and-test mining method, TreeMinerV, and

over the enumeration based method MB3-Miner [105] for generating embedded subtrees,

as discussed in Chapter 3.

All the proposed methods were written in C++ with the STL library support and com-

piled using an intel compiler with the -O3 optimisations. These methods were evaluated

on both the synthetic datasets and the real-life datasets detailed in Chapter 3. More-

over, the experiments were conducted by constraining the length of the concise frequent

subtrees generated using the length constrained concise frequent subtree mining methods.

The range of the length of the subtree (const) was set from 3 to 11. The choice of the

lower bound of this range for the const to 3 is because when the length of the subtree is 2,

it becomes a path, and when it has the length of 1, the subtree becomes a node (or tag).

This section is designed to perform a comparison of the following sets based on the

evaluation metrics and the datasets discussed in Chapter 3:

122

� Concise Frequent subtrees vs frequent subtrees using prefix-based pattern mining

methods;

� The prefix-based approach vs the generate-and-test approach;

� Closed vs Maximal frequent pattern mining methods;

� Induced vs Embedded frequent pattern mining methods; and

� Concise Frequent Pattern mining vs Constrained Concise frequent pattern mining

methods.

The experimental study is conducted based on the divisions of the datasets discussed

in the research design (Chapter 3). They are:

� Synthetic datasets

– On F5 dataset; and

– On D10 dataset.

� Real-life datasets

– On small-sized real-life dataset;

– On medium-sized real-life dataset; and

– On large-sized real-life datasets.

4.7.1 Evaluation of frequent pattern mining methods on synthetic datasets

In this subsection, the proposed concise frequent pattern mining methods will be compared

for their runtime and the number of patterns on the two synthetic datasets, F5 and D10.

The difference between these two datasets is their branching factor, with D10 having

higher fan-out or branching factor than F5.

123

On F5 dataset

0

0.1

0.2

0.3

0.4

0.5

0.6

2 4 6 8 10

Ru

nti

me (

in s

ecs)

Minimum Support (min_supp in %)

Runtime comparison of Induced Subtree miners on F5 Dataset

PCITMiner

PMITMiner

PrefixTreeISpan

0

0.2

0.4

0.6

0.8

1

1.2

2 4 6 8 10

Ru

nti

me (

in s

ecs)

Minimum Support(min_supp in %)

Runtime comparison of Embedded Subtree miners on F5 Dataset

PCETMiner

PMETMiner

PrefixTreeESpan

MB3-Miner

TreeMinerV

(a) Runtimes for Induced subtrees (b) Runtimes for Embedded subtrees

1

10

100

1 2 4 6 8 10

# F

req

Su

btr

ees


# Frequent Induced subtrees on F5 dataset

CFI Subtrees

MFI Subtrees

FI Subtrees

1

10

100

1000

1 2 4 6 8 10

# F

req

Su

btr

ees


# Frequent Embedded subtrees on F5 dataset

CFE Subtrees

MFE Subtrees

FE Subtrees

(c) No. of Induced subtrees (d) No. of Embedded subtrees

Figure 4.6: Runtime and number of subtrees comparison on F5 dataset

Figure 4.6(a) shows that PCITMiner performs much faster than the PrefixTreeISpan;

PCETMiner perform faster than both TreeMinerV and PrefixTreeESpan, especially at

lower support thresholds, as in Figure 4.6(b). It performs almost on a par with MB3-

Miner. In spite of the very large number of subtrees in this dataset, both PCITMiner and

PCETMiner are able to perform better than other methods, as shown in Figures 4.6(c)

and (d).

A comparison of the length constrained mining methods for various constraint lengths

is shown in Figure 4.7 (more results are presented in Appendix B.1 and B.2). It clearly

reveals that length constrained concise frequent induced subtree miners generate patterns

in less time than the concise frequent embedded subtree miners. This is due to the time

taken to identify the embedded relationship between the nodes and also from applying

124

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

3 5 7 9 11

Ru

nti

me (

in s

ecs)

Constraint Length

Runtime comparison of Length Constrained Subtree miners on F5 (at 2% min_supp)

PCITMinerConst

PMITMinerConst

PCETMinerConst

PMETMinerConst

0

0.05

0.1

0.15

0.2

0.25

3 5 7 9 11

Ru

nti

me (

in s

ecs)

Constraint Length

Runtime comparison of Length Constrained Subtree miners on F5 (at 10% min_supp)

PCITMinerConst

PMITMinerConst

PCETMinerConst

PMETMinerConst


0

5

10

15

20

25

3 5 7 9 11

# F

req

Su

btr

ees

Constraint Length

# Length Constrained Subtrees on F5 dataset (at 2% min_supp)

CFIConst

MFIConst

CFEConst

MFEConst

00.5

11.5

22.5

33.5

44.5

3 5 7 9 11#

Freq

Su

btr

ees

Constraint Length

# Length Constrained Subtrees on F5 dataset (at 10% min_supp)

CFIConst

MFIConst

CFEConst

MFEConst


Figure 4.7: Runtime and number of length constrained frequent concise subtreescomparison on F5 dataset

the conciseness on them. A comparison of the runtime of the length constrained concise

frequent subtree miners and their corresponding concise frequent subtree miners shows that

there are fewer length constrained frequent induced subtrees produced due to the early

termination based on the constraint length. On the other hand, the CFEConst subtrees

are greater in number due to the various combinations of the nodes and differences in

support values. The CFEConst is less than the frequent embedded subtrees. However,

the MFEConst, since it is based only on the support threshold and not on the independent

support values, is able to produce a lesser number of subtrees than the CFEConst. It can

also be noted that the number of subtrees varies at a lower support threshold; however

there is no variation of the number of subtrees at the higher support threshold, even at

10% due to far fewer longer trees in the dataset. Since it can be seen that the frequent

subtrees could be generated faster and also with an increase in the length of the frequent

125

subtrees, the number of maximal subtrees generated could be reduced.

On D10 dataset

Figure 4.8 shows that PCITMiner generates concise frequent subtrees faster than its

counterpart PCETMiner on the D10 dataset. This is due to the larger number of concise

frequent embedded subtrees generated in comparison to the concise frequent induced sub-

trees and embedded subtrees. The same reason applies for the embedded subtree miners

MB3-Miner and TreeMinerV, which perform faster at 1% in the lower threshold on D10.

However, in the same dataset, at supports higher than 2%, PCETMiner is almost equal to

or is faster than MB3-Miner and TreeMinerV. PCETMiner performs much faster than the

prefix-based pattern growth method for frequent embedded subtrees, PrefixTreeESpan,

inspite of a 13-fold reduction in the number of subtrees generated. In general, the pro-

posed PCITMiner and PCETMiner methods produce far fewer patterns, in comparison to

other induced and embedded subtrees miners.

Table 4.15: Runtime comparison of length constrained subtrees on the D10 datasetMin supp Const PCITMiner PMITMiner PCETMiner PMETMiner

Const const Const Const2 3 0.76 0.75 1.03 1.04

5 1.04 1.05 2.09 2.117 1.12 1.13 2.42 2.429 1.1 1.13 2.43 2.4311 1.14 1.12 2.45 2.45

4 3 0.58 0.58 0.61 0.615 0.71 0.73 0.78 0.797 0.73 0.73 0.79 0.799 0.75 0.74 0.79 0.7911 0.75 0.75 0.8 0.8

6 3 0.52 0.5 0.49 0.495 0.6 0.59 0.61 0.617 0.58 0.58 0.58 0.599 0.61 0.6 0.6 0.611 0.6 0.6 0.58 0.58

8 3 0.43 0.44 0.33 0.335 0.43 0.44 0.34 0.347 0.43 0.44 0.34 0.349 0.43 0.44 0.34 0.3511 0.44 0.44 0.34 0.34

10 3 0.43 0.44 0.33 0.335 0.43 0.44 0.34 0.347 0.43 0.44 0.34 0.349 0.43 0.44 0.34 0.3411 0.44 0.44 0.34 0.34

It is interesting to note that from Tables 4.15 and 4.16 for constrained frequent pattern

mining methods there are fewer MFEConst subtrees produced than there are CFIConst

subtrees. This shows the strength of maximality even while preserving the embedded rela-

tionship. With respect to the runtime, PCITMinerConst and PMETMinerConst perform

126

0

0.2

0.4

0.6

0.8

1

1 2 4 6 8 10

Ru

nti

me (

in s

ecs)

Minimum Support(min_supp in %)

Runtime comparison of Induced Subtree miners on D10 Dataset

PCITMiner

PMITMiner

PrefixTreeISpan0

1

2

3

4

5

6

7

8

1 2 4 6 8 10

Ru

nti

me (

in s

ecs)


Runtime comparison of Embedded Subtree miners on D10 Dataset

PCETMiner

PMETMiner

PrefixTreeESpan

MB3-Miner

TreeMinerV


1

10

100

1000

1 2 4 6 8 10

# F

req

Su

btr

ees


# Frequent Induced subtrees on D10 dataset

CFI Subtrees

MFI Subtrees

FI Subtrees

1

10

100

1000

1 2 4 6 8 10#

Freq

Su

btr

ees


# Frequent Embedded subtrees on D10 dataset

CFE Subtrees

MFE Subtrees

FE Subtrees


Figure 4.8: Runtime and number of subtrees comparison on the D10 dataset

Table 4.16: Length constrained subtrees in the D10 datasetMin supp Const PCITMiner PMITMiner PCETMiner PMETMiner

Const const Const Const2 3 17 14 42 39

5 26 13 83 567 27 9 56 289 25 7 51 2311 25 7 51 23

4 3 9 7 15 125 10 4 15 87 10 4 15 89 10 4 15 811 10 4 15 8

6 3 6 4 11 85 7 2 11 57 7 2 11 59 7 2 11 511 7 2 11 5

8 3 3 2 4 25 3 2 4 27 3 2 4 29 3 2 4 211 3 2 4 2

10 3 3 2 4 25 3 2 4 27 3 2 4 29 3 2 4 211 3 2 4 2

the same.

A comparison of the results on the two synthetic datasets, F5 and D10, reveals that,

with respect to the branching factor, the proposed concise frequent methods are scalable

at very low support thresholds and could reduce the number of subtrees by up to 90%,

127

especially for embedded subtrees. In spite of the high branching factor resulting in a very

large number of frequent subtrees due to the various possible ancestor-descendant rela-

tionships it provides, the proposed methods could produce a reduced number of subtrees

with runtime almost comparable with other benchmarks. Hence, the proposed methods

are also suitable for datasets with a high branching factor.

4.7.2 Evaluation of frequent pattern mining methods on real-life datasets

This subsection evaluates the proposed frequent pattern mining methods on real-life

datasets such as ACM, INEX IEEE, INEX 2007 and INEX 2009 using the evaluation

metrics discussed in Chapter 3.

4.7.2.1 On small-sized real-life dataset

The runtime and the number of subtrees comparisons of the frequent pattern mining

methods are shown in Figure 4.9 for the ACM dataset.

It can be noted that the number of MFI subtrees generated at 10% and 20% was equal

to 3 and was 4 for higher thresholds. As the documents essentially came from two DTDs,

there is not much variation in the MFI subtrees. On the other hand, this commonality

cannot be identified using only the frequent induced subtrees. If the grouping is based

on the structure of the documents then MFI can easily be used to identify the structural

clusters. The impact of the constraints is shown in Figure 4.10. However, there is a very

minimal difference in the number of length constrained concise frequent patterns and also

in the runtime.

128

0

0.2

0.4

0.6

0.8

1

1.2

1.4

10 20 30 40 50

Ru

nti

me (

in s

ecs)


Runtime comparison of Induced Subtree miners on ACM Dataset

PCITMiner

PMITMiner

PrefixTreeISpan

0.01

0.1

1

10

100

1000

10000

10 20 30 40 50

Ru

nti

me (

in s

ecs)


Runtime comparison of Embedded Subtree miners on ACM Dataset

PCETMiner

PMETMiner

PrefixTreeESpan

MB3-Miner

TreeMinerV


1

10

100

1000

10000

100000

10 20 30 40 50

# F

req

Su

btr

ees


# Frequent Induced subtrees on ACM dataset

CFI subtrees

MFI subtrees

FI subtrees

1

10

100

1000

10000

100000

1000000

10000000

10 20 30 40 50#

Freq

Su

btr

ees


# Frequent Embedded subtrees on ACM dataset

CFE subtrees

MFE subtrees

FE subtrees


Figure 4.9: Runtime and number of subtrees comparison on ACM dataset

0.01

0.1

1

10

100

1000

10000

100000

3 5 7 9

Ru

nti

me (

in s

ecs)

Constraint Length

Runtime comparison of Length Constrained Subtree miners on ACM (at 10% min_supp)

PCITMinerConst

PMITMinerConst

PCETMinerConst

PMETMinerConst

0

0.02

0.04

0.06

0.08

0.1

3 5 7 9

Ru

nti

me (

in s

ecs)

Constraint Length

Runtime comparison of Length Constrained Subtree miners on ACM (at 30% min_supp)

PCITMinerConst

PMITMinerConst

PMETMinerConst

PMETMinerConst


1

10

100

1000

10000

100000

1000000

3 5 7 9

# F

req

Su

btr

ees

Constraint Length

# Length Constrained Subtrees on ACM dataset (at 10% min_supp)

CFIConst

MFIConst

CFEConst

MFEConst

1

10

100

1000

3 5 7 9

# F

req

Su

btr

ees

Constraint Length

# Length Constrained Subtrees on ACM dataset (at 30% min_supp)

CFIConst

MFIConst

CFEConst

MFEConst


Figure 4.10: Runtime and number of length constrained frequent concise subtreescomparison on ACM dataset

129

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10 20 30 40 50

Ru

nti

me (

in s

ecs)


Runtime comparison of Induced Subtree miners on DBLP Dataset

PCITMiner

PMITMiner

PrefixTreeISpan

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

10 20 30 40 50

Ru

nti

me (

in s

ecs)


Runtime comparison of Embedded Subtree miners on DBLP Dataset

PCETMiner

PMETMiner

PrefixTreeESpan

MB3-Miner

TreeMinerV


1

10

100

1000

10 20 30 40 50

# F

req

Su

btr

ees


# Frequent Induced subtrees on DBLP dataset

CFI Subtrees

MFI Subtrees

FI Subtrees

0.1

1

10

100

1000

10000

10 20 30 40 50

# F

req

Su

btr

ees


# Frequent Embedded subtrees on DBLP dataset

CFE Subtrees

MFE Subtrees

FE Subtrees


Figure 4.11: Runtime and number of subtrees comparison on DBLP dataset

4.7.2.2 On medium-sized real-life dataset

It can be seen from Figure 4.11 that the proposed methods have identified the concise

frequent subtrees in optimal duration for the DBLP dataset. A comparison of the runtime

between the concise frequent subtree miners revealed that there is a minimal difference

in the runtime (≈ 0.1 secs). This is because the average length of the subtrees in the

dataset is only 25 (as mentioned in Chapter 3) and the frequent patterns do not have

much variation in their node relationships.

From the comparison of constrained length frequent subtrees at 10% and 30% support

threshold in Figure 4.12, both PCITMinerConst and PMITMinerConst take almost the

same amount of time to produce varied length subtrees, but the number of MFIConst

subtrees produced by PMITMinerConst is much less in comparison to CFIConst. This

implies that there are a large number of subtrees that have different support values and

130

00.10.20.30.40.50.60.70.80.9

3 5 7 9 11

Ru

nti

me (

in s

ecs)

Constraint Length

Runtime comparison of Length Constrained Subtree miners on DBLP (at 10% min_supp)

PCITMinerConst

PMITMinerConst

PCETMinerConst

PMETMinerConst

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

3 5 7 9 11

Ru

nti

me (

in s

ecs)

Constraint Length

Runtime comparison of Length Constrained Subtree miners on DBLP (at 30% min_supp)

PCITMinerConst

PMITMinerConst

PCETMinerConst

PMETMinerConst


0.1

1

10

100

1000

3 5 7 9 11

# F

req

Su

btr

ees

Constraint Length

# Length Constrained Subtrees on DBLP dataset (at 10% min_supp)

CFIConst

MFIConst

CFEConst

MFEConst

0.1

1

10

100

3 5 7 9 11#

Freq

Su

btr

ees

Constraint Length

# Length Constrained Subtrees on DBLP dataset (at 30% min_supp)

CFIConst

MFIConst

CFEConst

MFEConst


Figure 4.12: Runtime and number of length constrained frequent concise subtreescomparison on DBLP dataset

maximality replaces them with their supertrees. However, PMITMinerConst takes almost

the same time as PCITMinerConst due to the large number of maximality checks required

for these subtrees.

4.7.2.3 On large-sized real-life datasets

In this subsection, the experiments were conducted on the large-sized real-life datasets

such as INEX 2007, INEX IEEE and INEX 2009 datasets to evaluate the efficiency of the

proposed methods and the benchmarks.

INEX 2007 dataset

Figure 4.13 shows that the proposed concise frequent subtree mining methods were

scalable for this large dataset, even with a maximum length of 48. The other benchmarks,

131

PrefixTreeESpan and TreeMinerV, were scalable only for higher support thresholds on this

dataset. These benchmarks failed at lower support threshold for this dataset. Though Pre-

fixTreeESpan was scalable, it takes longer than PCETMiner. Also, TreeMinerV fails to

produce any results for support thresholds less than 40%. This behaviour of frequent pat-

tern mining methods shows the strengths of applying the proposed methods to generate

concise frequent subtrees over the benchmarks to generate frequent subtrees. It is inter-

esting to note that at a support threshold of 10%, there are 10 times more CFE subtrees

than CFI subtrees. Figures 4.14 and 4.15 show the length constrained frequent subtree

miners with support thresholds of 20% and 50% respectively.

0

10

20

30

40

50

10 20 30 40 50

Ru

nti

me (

in s

ecs)


Runtime comparison of Induced Subtree miners on INEX 2007 dataset

PCITMiner

PMITMiner

PrefixTreeISpan

1

10

100

1000

10000

100000

10 20 30 40 50

Ru

nti

me (

in s

ecs)


Runtime comparison of Embedded Subtree miners on INEX 2007 Dataset

PCETMiner

PMETMiner

PrefixTreeESpan

TreeMinerV


1

10

100

1000

10000

10 20 30 40 50

# F

req

Su

btr

ees


# Frequent Induced subtrees on INEX 2007 dataset

CFI Subtrees

MFI Subtrees

FI Subtrees

1

10

100

1000

10000

100000

10 20 30 40 50

# F

req

Su

btr

ees


# Frequent Embedded subtrees on INEX 2007 dataset

CFE Subtrees

MFE Subtrees

FE Subtrees


Figure 4.13: Runtime and number of subtrees comparison on INEX2007 dataset

On the other large-sized datasets, INEX IEEE and INEX 2009, where the average

length of the subtrees is greater than 100, the number of documents are large (greater

than 5000), and therefore none of the frequent pattern mining methods could be ap-

plied. However, to identify concise frequent subtrees, the proposed length constrained

132

0

5

10

15

20

25

30

35

40

3 5 7 9 11

Ru

nti

me (

in s

ecs)

Constraint Length

Runtime comparison of Length Constrained Induced Subtree miners on INEX 2007 (at 20% min_supp)

PCITMinerConst

PMITMinerConst

1

10

100

1000

10000

100000

3 5 7 9 11

Ru

nti

me (

in s

ecs)

Constraint Length

Runtime comparison of Length Constrained Embedded Subtree miners on INEX 2007 (at 20% min_supp)

PCETMinerConst

PMETMinerConst


1

10

100

1000

3 5 7 9 11

# F

req

Su

btr

ees

Constraint Length

# Length Constrained Subtrees Induced on INEX 2007 dataset (at 20% min_supp)

CFIConst

MFIConst

1

10

100

1000

10000

3 5 7 9 11#

Freq

Su

btr

ees


# Length Constrained Embedded Subtrees on INEX 2007 dataset (at 20% min_supp)

CFEConst

MFEConst


Figure 4.14: Runtime and number of length constrained frequent induced subtreescomparison on INEX2007 dataset at 20%

0

20

40

60

80

100

120

140

3 5 7 9

Ru

nti

me (

in s

ecs)

Constraint Length

Runtime comparison of Length Constrained Subtree miners on INEX 2007 (at 50% min_supp)

PCITMinerConst

PMITMinerConst

PCETMinerConst

PMETMinerConst

1

10

100

1000

3 5 7 9

# F

req

Su

btr

ees

Constraint Length

# Length Constrained Subtrees on INEX 2007 dataset (at 50% min_supp)

CFIConst

MFIConst

CFEConst

MFEConst

(a) Runtimes for subtrees (b) No. of subtrees

Figure 4.15: Runtime and number of length constrained frequent induced subtreescomparison on INEX2007 dataset at 50%

concise frequent subtree miners, PCITMinerConst, PMITMinerConst, PCETMinerConst

and PMETMinerConst, are applied.

INEX IEEE dataset

Although the number of document trees was only 6054, the average length of a doc-

133

ument was more than 75. This has resulted in a very dense dataset and none of the

benchmarks for frequent pattern mining were able to mine for frequent patterns.

INE

1

10

100

1000

10000

3 5 7 9 11

Ru

nti

me (

in s

ecs)

Constraint Length

Runtime comparison of Length Constrained Induced Subtree miners on INEX IEEE (at 20% min_supp)

PCITMinerConst

PMITMinerConst

1

10

100

1000

3 5 7 9 11

Ru

nti

me (

in s

ecs)

Constraint Length

Runtime comparison of Length Constrained Induced Subtree miners on INEX IEEE (at 50% min_supp)

PCITMinerConst

PMITMinerConst

(a) Runtimes for Induced subtrees at 20% (b) Runtimes Induced subtrees at 50%

1

10

100

1000

10000

100000

1000000

3 5 7 9 11

# F

req

Su

btr

ees

Constraint Length

# Length Constrained Induced Subtrees on INEX IEEE dataset (at 20% min_supp)

CFIConst

MFIConst

1

10

100

1000

10000

3 5 7 9 11

# F

req

Su

btr

ees

Constraint Length

# Length Constrained Induced Subtrees on INEX IEEE dataset (at 50% min_supp)

CFIConst

MFIConst

(c) No. of Induced subtrees at 20% (d) No. of Induced subtrees at 50%

Figure 4.16: Runtime and number of length constrained frequent induced subtreescomparison on INEX IEEE dataset at 20% and 50%

0

10

20

30

40

50

60

10 20 30 40 50

Ru

nti

me (

in s

ecs)


Runtime comparison of Length Constrained Embedded Subtree miners on INEX IEEE

PCETMinerConst

PMETMinerConst

0

500

1000

1500

2000

2500

3000

3500

20 30 40 50 60

# F

req

Su

btr

ees


# Length Constrained Embedded Subtrees in INEX IEEE dataset (const=3)

CFEConst

MFEConst

(a) Runtimes for Embedded subtrees (b) No. of Embedded subtrees

Figure 4.17: Runtime and number of length constrained frequent embedded subtreescomparison on INEX IEEE dataset at 50%

Also, it took longer than 2 hours for PCITMiner, PMITMiner, PCETMiner and

PMETMiner to mine for frequent subtrees. Hence the mining process was terminated.

However, the length constrained concise frequent subtrees were applied and the results

134

were reported in Figures 4.16 and 4.17.

INEX 2009 dataset

Figures 4.18 and 4.19 show that all the length constrained induced and embedded

subtree miners are scalable for the INEX 2009 dataset with varying constraint lengths.

This shows that these concise frequent pattern mining methods could be applied for these

types of highly dense and deep structured datasets. These miners were able to mine for

frequent subtrees, even with the increase in the constraint length at both lower and higher

support thresholds.

0200400600800

100012001400160018002000

3 5 7 9 11

Ru

nti

me (

in s

ecs)

Constraint Length


PCITMinerConst

PMITMinerConst

0

20

40

60

80

100

120

140

3 5 7 9 11

Ru

nti

me (

in s

ecs)

Constraint Length


PCITMinerConst

PMITMinerConst

(a) Runtimes for Induced subtrees at 20% (b) Runtimes Induced subtrees at 50%

0

1000

2000

3000

4000

5000

6000

7000

3 5 7 9 11

# F

req

Su

btr

ees

Constraint Length


CFIConst

MFIConst

0

20

40

60

80

100

120

140

160

3 5 7 9 11

# F

req

Su

btr

ees

Constraint Length


CFIConst

MFIConst

(c) No. of Induced subtrees at 20% (d) No. of Induced subtrees at 50%

Figure 4.18: Runtime and number of length constrained frequent induced subtreescomparison on INEX 2009 dataset at 20% and 50%

Tables 4.17 and 4.18 provide a comparison of the performance of the proposed methods

and the benchmarks, on both the synthetic and real-life datasets where λ and ρ denote the

runtime and the number of patterns generated for each of the methods. “F” indicates that

the methods fail to scale for the respective datasets. In this setting the support threshold

135

0

500

1000

1500

2000

2500

3000

3500

4000

20 30 40 50 60 70

Ru

nti

me (

in s

ecs)


Runtime comparison of Length Constrained Embedded Subtree miners on INEX 2009 (const=3)

PCETMinerConst

PMETMinerConst

0

500

1000

1500

2000

2500

3000

3500

4000

20 30 40 50 60 70

# F

req

Su

btr

ees


# Length Constrained Subtrees on INEX 2009 dataset (const=3)

CFEConst

MFEConst

(a) Runtimes for Embedded subtrees (b) No. of Embedded subtrees

Figure 4.19: Runtime and number of length constrained frequent embedded subtreescomparison on the INEX 2009 dataset at 50%

of 2% is used for the synthetic datasets, ACM, DBLP, INEX IEEE and INEX 2009 at

20%, INEX 2007 at 30% (as TreeMinerV fails below this threshold).

Table 4.17: Summary of frequent pattern mining results on synthetic datasetsMethods Datasets

F5 D10λ ρ λ ρ

PCITMinerConst 0.33 17 0.76 17PMITMinerConst 0.33 9 0.75 14PCETMinerConst 0.6 21 1.03 45PMETMinerConst 0.6 23 1.04 87

PCITMiner 0.51 21 0.73 32PMITMiner 0.57 7 0.66 6PCETMiner 0.65 21 1.55 51PMETMiner 0.65 10 1.56 23

PrefixTreeISpan 0.5 44 0.73 72PrefixTreeESpan 0.5 289 2.64 104

MB-3Miner 0.83 289 1.47 104TreeMinerV 1.16 289 1.64 104

From the empirical evaluation on the runtime and the number of frequent subtrees

(summarised in Figures 4.20 and 4.21), it is clear that the proposed methods for concise

frequent subtrees such as PCITMiner, PMITMiner, PCETMiner and PMETMiner perform

better than PrefixTreeISpan [141] and PrefixTreeESpan [140], not only in reducing the

number of subtrees but also in reducing runtimes. However, these methods took longer

on the INEX IEEE and the INEX 2009 datasets for longer documents and hence the

constrained concise frequent pattern mining methods were applied on these datasets.

MB3-Miner performs the same as TreeMinerV on ACM dataset, as shown in Figure

4.20 (a). On the DBLP dataset, MB3-Miner clearly outperforms TreeMinerV; however, for

136

Table

4.18:Summary

offrequentpattern

miningre

sultson

real-life

datasets

Meth

ods

Data

sets

ACM

DBLP

INEX

INEX

INEX

2007

IEEE

2009

λρ

λρ

λρ

λρ

λρ

PCIT

MinerCon

st0.12

842

0.3

109

2.4

321.97

169

104.6

263

PMIT

MinerCon

st0.15

841

0.3

103

3.65

201.97

132

72.87

210

PCETMinerCon

st10.09

1861

0.54

232

11.97

134

44.31

2006

3347.16

3657

PMETMinerCon

st10.16

1203

0.54

186

13.1

134

44.31

2006

3349.17

3619

PCIT

Miner

0.12

190.14

3413.91

70F

FF

F

PMIT

Miner

0.14

30.17

613.9

11F

FF

F

PCETMiner

2.5

180.34

731503

1002

FF

FF

PMETMiner

2.5

60.34

101503.3

253

FF

FF

PrefixTreeISpan

0.12

3265

0.36

337

19.6

665

FF

FF

PrefixTreeE

Span

1.54

69519

0.42

448

1761.47

11705

FF

FF

MB-3Miner

169519

0.13

448

FF

FF

FF

TreeM

inerV

1.02

69519

0.25

448

13677.1

11705

FF

FF

137

0 1 2 3 4 5 6 7

x 104

10−2

10−1

100

101

102

Run

time

(in

secs

)

Number of Frequent Subtrees

ACM

0 50 100 150 200 250 300 350 400 450

10−0.8

10−0.7

10−0.6

10−0.5

10−0.4

10−0.3

Run

time

(in se

cs)


DBLP dataset

PCITMinerConstPMITMinerConstPCETMinerConstPMETMinerConstPCITMinerPMITMinerPCETMinerPMETMinerPrefixTreeISpanPrefixTreeESpanMB3MinerTreeMinerV

(a) ACM (b) DBLP

0 2000 4000 6000 8000 10000 12000100

101

102

103

104

105

Runti

me (i

n sec

s)


INEX 2007

(c) INEX 2007

Figure 4.20: Comparison of the runtimes vs number of frequent subtrees on ACM,DBLP and INEX 2007 datasets

0 500 1000 1500 2000 2500100

101

102

Run

time

(in se

cs)


INEX IEEE

0 1000 2000 3000 4000101

102

103

104

Run

time

(in se

cs)


INEX 2009PCITMinerConstPMITMinerConstPCETMinerConstPMETMinerConstPCITMinerPMITMinerPCETMinerPMETMinerPrefixTreeISpanPrefixTreeESpanMB3MinerTreeMinerV

(a) INEX IEEE (b) INEX 2009

Figure 4.21: Comparison of the runtimes vs number of frequent subtrees on INEXIEEE and INEX 2009 datasets

138

lower support thresholds it performs faster than PrefixTreeESpan, and in some situations it

is almost similar to PCETMiner and PMETMiner. However, at higher support thresholds,

the latter two methods seem to be faster than MB3-Miner. Also, the MB3-Miner could not

scale for the INEX 2007 dataset, but PrefixTreeESpan and PCETMiner could efficiently

mine for frequent subtrees, even for lower support thresholds, in less than 45 seconds.

The memory consumption of the MB3-Miner exceeded the experimental design set-up of

16GB of memory, even for supports of about 50% , and therefore the mining process was

terminated. From this empirical evaluation, it can be seen that MB3-Miner is not scalable

for large-sized datasets and for datasets with deeper trees and or a large branching factor.

However, it is efficient for small and medium-sized datasets such as ACM and DBLP, which

have fewer documents and narrower, shorter branches than large-sized datasets have. In

addition, MB3-Miner could not scale for either the INEX 2009 or the INEX IEEE datasets

and hence their results were not reported.

4.8 Discussion and summary

This section discusses the algorithmic design of the proposed frequent pattern mining

methods and the empirical evaluation of these methods and the benchmarks conducted in

this chapter.

4.8.1 Algorithmic Design

The algorithmic complexity of the concise frequent pattern mining algorithms is deter-

mined as O(d ∗ s ∗ m) where d represents the number of documents, s is the number

of 1-Length concise frequent subtrees, and m is the number of iterations of the function

Fre in the concise frequent pattern mining algorithm. For the length constrained concise

139

frequent pattern mining algorithm, m is terminated early if the length of the generated

subtrees is equal to const.

4.8.2 Empirical Evaluation

� As it can be seen from the mining results, the frequent pattern mining methods

such as TreeMinerV and MB3-Miner were not scalable for INEX 2007, INEX IEEE,

and INEX 2009 datasets. These datasets had a larger number of nodes and longer

document trees, compared to other datasets. The proposed methods were able to

mine for concise frequent subtrees for these large-sized datasets with the aid of tree

length constraints.

� The results from the larger datasets clearly indicate that the post-processing of

frequent subtrees to generate concise frequent subtrees is not practically feasible

since the frequent subtree mining methods could not complete the mining process

due to the explosion of the number of frequent subtrees generated. The process to

generate concise frequent subtrees should be embedded within the frequent pattern

mining process, as in this thesis.

� As a parent-child relationship is a subset of an ancestor-descendant relationship,

there are a larger number of embedded subtrees generated than induced subtrees.

In some datasets, there was about 10 times increase in the number of embedded

subtrees to induced subtrees.

� The property of maximal helps to reduce the number of subtrees over closed and

frequent subtrees. Unlike the methods for generating closed frequent subtrees, the

methods for generating maximal frequent subtrees such as PMITMiner, PMET-

Miner, PMITMinerConst and PMETMinerConst avoid checking the GNs for the

same support and hence they are more efficient than closed frequent subtree mining

140

methods in reducing the number of frequent patterns. This is due to the reason

that maximality is a less stringent condition, and its checking is faster than closure.

However, due to the number of checks required, it takes almost the same amount of

time as that of closure in small datasets to identify maximal pattern. In datasets,

when the number of patterns are more, the maximality checks are increased with

this, hence resulting in this taking a longer time than the closure checks. Though

maximality results in a reduced number of subtrees, it is essential to identify whether

this results in any information loss in future applications. Hence, in the next section,

a comparison of the clustering accuracy is conducted on both maximal and closed

frequent subtrees.

� Furthermore, since the length constrained subtree mining methods are efficient in

performance, its effectiveness was evaluated by conducting a selectivity analysis.

The length constrained subtree mining methods produce length constrained concise

frequent patterns which may not be fully concise for the given dataset due to early

termination. This depends on the nature of the datasets; at lower const values the

number of subtrees generated are less in number. However, in some datasets ACM

and DBLP, when the length of the subtrees is between 4 and 8, then the number of

length constrained concise subtrees are greater than the number of concise subtrees.

Hence, it is essential to understand the impact of these generated subtrees in the

clustering task. Moreover, it is also vital to understand if any information loss occurs

in using the length constrained subtrees in future applications.

� The empirical analysis reveals that the proposed concise frequent subtree mining

methods are not only scalable for even 100,000 documents (in F5 and D10 datasets)

but also for datasets with a high branching factor with document trees of more than

10,000 nodes and containing about 34,000 tags (in INEX 2009 dataset).

141

� In spite of the studies indicating that the performance of some of the existing closed

frequent subtree mining methods degrades for datasets having a high branching fac-

tor [107], the experimental results show that the response time of PCITMiner on

datasets having high branched trees, in F5 and D10, is lower than that of Prefix-

TreeISpan.

� A comparison of the prefix-based pattern growth and generate-and-test approaches

reveals that most of the methods using the latter approach cannot scale for dense

datasets that could be scaled by methods using the prefix-based pattern growth

approach.

4.9 Chapter Conclusion

This chapter has provided the details for the proposed concise frequent subtree mining

methods. The pre-processing steps required to extract trees from the XML data were

presented. An overview of the various types of concise frequent subtrees was presented

along with the details of the constraint parameters. Several concise frequent subtree min-

ing methods were proposed using the two optimisation techniques, search space reduction

using backward scan and node extension checking.

Furthermore, all the proposed methods were evaluated on both synthetic and real-life

datasets exhibiting extreme characteristics. These methods were also benchmarked against

the existing state-of-the-art frequent pattern mining methods for both the induced and

the embedded frequent subtrees. While doing so, various parameters, such as the runtime

required to generate the concise frequent subtrees, the number of frequent subtrees and

the number of projections required to generate them on various support thresholds, were

also examined. This chapter has also conducted an in-depth sensitivity analysis of the

142

length constraint on the runtime, the number of projections and the number of frequent

subtrees.

143

Chapter 5

XML Clustering

5.1 Introduction

The discussion in Chapter 2 has established the merit of combining both the structure

and the content features for XML clustering. Chapter 2 also laid the foundation of a

hybrid XML clustering approach, which would benefit the frequent subtrees in the clus-

tering process of XML documents. Chapter 4, which was on frequent pattern mining

approaches, emphasised the importance of concise patterns not only in reducing the run-

time performance but also in deriving meaningful information. The focus of this chapter

is to present a novel clustering methodology that utilises the concise frequent subtrees and

their corresponding content implicitly and explicitly.

This chapter explains and analyses the XML clustering methodology which has been

developed in this thesis. The chapter begins with an overall presentation of the proposed

Hybrid Clustering of XML documents (HCX) methodology, giving details of each of the

steps in the subsequent sections. The second section explains the details of the HCX using

the Vector Space Model to implicitly express the relationship between the structure and

the content features in XML documents. The third section introduces the novel approach

145

using the Tensor Space Model to explicitly express the relationship between these two

features. An in-depth analysis of the two clustering approaches over other state-of-the-art

approaches were conducted using real-life datasets. The final section provides a discussion

of the analysis of the proposed clustering approaches.

5.2 Hybrid Clustering of XML Documents (HCX) Method-

ology : An Overview

Figure 5.1 provides the overview of the HCX methodology for clustering the XML doc-

uments. It begins by utilising the concise frequent subtree mining methods to generate

different types of concise frequent subtrees. Using one of the different types of concise fre-

quent subtrees, it can extract the content from the XML documents and represent both of

these features for clustering. There are two models in this methodology that represent the

structure and the content of the XML documents, the Vector Space Model (VSM) and the

Tensor Space Model (TSM). Each of the two models adopts different techniques to utilise

the generated concise frequent subtrees to extract the content from the XML documents.

The clustering methods using the VSM and the TSM are called the Hybrid Clustering of

XML documents using the Vector Space Model (HCX-V) and Hybrid Clustering of XML

documents using the Tensor Space Model (HCX-T) respectively.

5.2.1 Hybrid Clustering of XML documents using the Vector Space

Model (HCX-V)

This method involves non-linearly combining the structure and the content of XML docu-

ments in the VSM. HCX-V begins by identifying the documents that contain the frequent

subtrees and then extracting their corresponding content. The content thus extracted is

146

HCX-THCX-V

CFI CFEMFI CFIConst MFIConst MFE CFEConst MFEConst

Application of Concise Frequent Subtree Mining

using prefix-based pattern growth

Identify the coverage of the documents for

the generated Concise Frequent

subtrees

Apply Partitional Clustering algorithm

Cluster 1

Cluster 2

Cluster N

Cluster 3

Combine the PCFs to form Intermediate Cluster Form(ICF) in

Vector Space Model(VSM)

XML Documents

Generate the Intermediate

Cluster Form(ICF) in Tensor Space

Model(TSM)

Apply Random Indexing

Apply Tensor Decomposition algorithm and

cluster the decomposed

values

Extract content from XML

documents using coverage and

represent in Pre-cluster Form(PCF)

Extract content from XML documents

using coverage and represent in Pre-

cluster Form(PCF)

Cluster the concise frequent subtrees

Identify the coverage of the documents for

the clusters of Concise Frequent

subtrees

Figure 5.1: Hybrid Clustering of XML documents (HCX) methodology

147

represented in an Intermediate Cluster Form (ICF ), which is a VSM. The ICF combines

the structural commonalities in the form of concise frequent subtrees with the content

information of XML documents to group the XML documents together. Hence, this phase

utilises an implicit combination of the structure and the content of the XML documents.

5.2.2 Hybrid Clustering of XML documents using the Tensor Space

Model (HCX-T)

HCX-T combines the structure and the content of XML documents in a novel way using a

multi-dimensional model. This involves clustering the concise frequent subtrees and then

using these clusters to extract the content. The extracted content is then represented in

the three dimensions of XML documents : the document id, the structure and the content

in a higher-order tensor model. Unlike the VSM, the TSM involves representing both the

features – structure and the content – in an explicit manner for a given document. As

indicated in italics in Figure 5.1, a Random Indexing step can also be included. This,

however, is essentially applied only for very large datasets when the number of features

are too high.

5.3 Using the Vector Space Model (VSM)

This subsection details the use of the Vector Space Model (VSM) [95] for non-linearly

combining the structure and the content for clustering. The VSM is a model for repre-

senting text documents or any objects as vectors of identifiers, for example, terms. When

using the VSM for XML clustering, the feature of the document content is a set of terms,

and the feature of the document structure is a set of substructures such as tags, paths,

subtrees, or subgraphs. In this research, the concise frequent subtrees will be used to

148

represent the document structure.

Figure 5.2 provides the high level definition of Hybrid Clustering of XML documents

using a Vector Space Model (HCX-V). The first task in this clustering method is to identify

the coverage of the concise frequent subtrees that are generated by applying the proposed

concise frequent pattern growth methods (in Chapter 4) on XML datasets.

Input : Document Dataset:D, Document Tree Dataset:DT, MinimumSupport:min supp, Length Constraint:const, Number of Clusters:c,Concise Frequent Subtrees:{CF1, . . . , CFj}

Output: Clusters: {Clust1, . . . , Clustc}begin

for every document Di ∈ D do1. Identify the coverage of the document tree DTi,δ(DTi) = {CF1, . . . , CFj′} ;2. for every CFj′j ∈ δ(DTi) do

i. Extract the structure-constrained content in Di

C(Di, CFj′j ) = {C(N1), . . . , C(Nm)}. The set of terms,

C(Ni) = {t1, . . . , tk ′} ∈ T , where T is the term list for the document Di ;ii. Represent the PCF of Di as a vector of the sum of occurrences of theunique terms in C(Di, δ(DTi)) ;

end3. Combine the PCF s of all Di in D to form ICF ;4. Divide the ICF into two clusters Clustx and Clusty ;5. while the similarity criterion of the content collection clusters Clustxand Clusty is greater than a threshold do

Bisect Clustx and Clusty until the number of clusters obtained is c ;end

end

end

Figure 5.2: High level definition of HCX-V approach

5.3.1 Identifying the coverage of concise frequent subtrees

The coverage of the document trees is identified from the given set of concise frequent

subtrees for the document trees dataset. The concept of coverage of the document trees

for a given concise representation is defined below.

149

Definition: Coverage of a document tree

Let there be a document tree dataset DT , which on applying the proposed concise

frequent pattern mining methods results in a set of concise frequent subtrees, CF =

{CF1, . . . , CFj}. A document tree DTi ∈ DT is said to be covered by a CF subtree,

CFj′j ∈ CF , if DTi preserves the same relationship among its nodes as that of CFj′j .

The coverage (δ) of a document tree DTi ∈ DT , denoted by δ(DTi) is the set of concise

frequent subtrees, {CF1, . . . , CFj′} ∈ CF , where j′ ≤ j, that covers theDTi. The coverage

of a document tree DTi, δ equals {CF1, . . . , CFj′} if DTi preserves the same relationship

among its nodes as that of {CF1, . . . , CFj′}.

In the case of closed frequent subtrees that are both induced and embedded, there

exist some overlapping subtrees in the coverage of the document trees which are defined

below.

Definition: Overlapping subtrees

Let there be two closed frequent subtrees, CFg, CFh ∈ δ(DTi). These two closed

frequent subtrees are called overlapping subtrees in δ(DTi) iff (1) CFg ⊂t CFh with either

an induced or an embedded relationship (2) supp(CFg)=α and supp(CFh)=β where α �= β.

Let there be a document tree dataset, DT on which by applying frequent subtree

mining with a given support threshold s results in frequent subtree mining result set,

O = {CF ′1, CF ′

2, CF ′3} such that CF ′

2 ⊃t CF ′1 and CF ′

2 ⊃t CF ′3. This contains three

frequent subtrees having a support of s, s and s + 1 respectively. Based on the closure

property, the closed subtrees will be CF ′2, CF ′

3 as there is a difference in the support

values for CF ′2 and CF ′

3. However by maximality, there is only one maximal subtree,

CF ′3. If there is a document tree DTi, such that DTi ⊃t CF ′

2 and DTi ⊃t CF ′3 then this is

required to extract similar content corresponding to both CF ′2 and CF ′

3. In order to avoid

150

this redundancy in content and to improve the efficiency of extraction, these overlapping

subtrees are identified and retain only the supertrees. In this situation, CF ′2 is retained

but CF ′3 is removed. This is because the content of CF ′

2 includes the content of CF ′3.

It should also be noted that the overlapping subtrees do not occur for maximal subtrees

since the maximal subtrees capture only the supertrees that are shown in the example.

The overlapping subtrees occur due to the presence of closed frequent subtrees having

different supports and subtree relationships with each other. If such overlapping subtrees

exist, then the subtree CFg is removed from δ(DTi). This process of removing overlapping

subtrees is conducted to avoid redundancy as these overlapping subtrees convey the same

information for the given document. Thus, the removal process facilitates in improving the

computational efficiency and there is no expected disadvantage of removing the overlapping

subtrees.

The next task in this method involves extracting the document content according to

its coverage. This document content is called the structure-constrained content. The

structure-constrained content contained within subtrees according to the coverage of a

DTi, δ(DTi), is defined below.

Definition: Structure-Constrained content features of an XML document ac-

cording to its coverage.

The structure-constrained content features of a given CFj′j ∈ δ(DTi), C(Di, CFj′j) of

an XML document Di, are a collection of node values corresponding to the node labels

in CFj′j of Di. Further, the structure-constrained content features of the set of concise

frequent subtrees in δ(DTi), C(Di, δ(DTi)), that represent an XML document Di, are a

collection of node values corresponding to node labels in its coverage δ(DTi).

The structure-constrained content of a CFj′j in Di ∈ D is retrieved from the XML

151

document Di. The sum of occurrences of the terms in the structure-constrained content

features is computed according to the coverage of the document. Given the coverage

of the document DTi, δ(DTi) = {CF1, . . . , CFj′} the structure-constrained content of

δ(DTi) is a collection of node values corresponding to the δ(DTi) given by C(Di, δ(DTi)) =

{C(Di, CF1), . . . , C(Di, CFj′)}, where C(Di, CFj′j ) ∈ C(Di, δ(DTi)), C(Di, CFj′j ) = {C

(N1), . . . , C(Nm)} with m nodes and C(Nm′) is the node value of node Nm′ wherem′ ≤ m.

Node value for a node (or tag), C(Nm′ , inDi is a vector of terms, {t1, . . . , tk′} that the node

contains. Each of the terms are obtained after pre-processing the node values. Various

pre-processing steps involved in obtaining the terms will be discussed in the following

subsection 5.3.2. For a given term tk′k in C(Di, CFj′j ) = {t1, . . . , tk′}, the sum of the

occurrences of the term tk′k in all concise frequent subtrees of δ(DTi) is computed using:

ς(tk′k) =k′∑i=1

ti(CFj′j ) (5.1)

The resulting vector, called the Pre-Cluster Form (PCF ), containing the sum of oc-

currences of all the terms for δ(DTi), is generated for each document Di in the collection

D. All these PCF s are combined in a matrix form called the Intermediate Cluster Form

(ICF ), which is essentially a VSM. ICF is a matrix of the form D×T where D represents

the documents, T represents the terms that are present in all δ(DTi), and the value of the

matrix cell is represented by ς(tk′k), the sum of the occurrences of a term in its vector.

As discussed before, in order to obtain the term from the node values the pre-processing

steps in the following subsection are applied.

152

5.3.2 Pre-processing of the structure-constrained content of XML doc-

uments

Similar to the pre-processing of structure (as discussed in Section 4.2), the pre-processing

phase of the structure-constrained content of XML documents involves four stages such as

1. Stop-word removal

2. Stemming

3. Integer removal

4. Shorter length words removal

Stop-word removal

Stop words are the words that are considered poor in terms of indexing and hence

these words need to be removed prior to performing data analysis. Traditionally stop

words consists of terms which occur very frequently such as ‘the’, ‘of’ and ‘and’. In order

to remove these stop words, a stop list or stop word list is generated and used to filter out

words that make poor index terms [92].

The most common stop list available for English text, from Christopher Fox, which

contains a list of 421 words [39]. Fox’s stop list includes variants of the stop words, such as

the word ‘group’ with its variants: ‘groups’, ‘grouped’ and ‘grouping’. It should be noted

that not all the common words can be considered as stop words. For example, ‘not’ is a

common word but it represents negation so if it is removed the meaning of the sentence

is changed. Hence, care should be taken in choosing the common stop list.

153

Instead of choosing a common stop list, a list that suits the dataset under investigation

should be considered. For example, the use of a common stop list causes the removal of the

word ‘back’ even though ‘back’ (a part of body) is a useful term in the medical domain. It is

therefore essential to customise the stop list by considering the domain-specific knowledge

in order to avoid removing important words for specific domains. In this research, the

stop word list has been customised by considering the tag names and the content of XML

documents for each relevant dataset.

Stemming

Stemming words is a process to remove affixes (suffixes or prefixes) from the words

and/or to replace them with the root word. For example, the word ‘students’ becomes

‘student’ and the word ‘says’ becomes ‘say’. Several well-known stemming algorithms

have been developed [47, 74, 89, 91]. The strength and similarity of different stemming

algorithms have been evaluated in [40]. The stemming process not only reduces the variety

of words, thereby the storage size, but also increases the performance of information

retrieval systems [15]. Studies have also demonstrated that stemming enhances the recall

aspect, a common measure in information retrieval [93]. This research uses the Porter

stemming algorithm [91] for affix removal. The major reasons for using this algorithm are

its simplicity [121] and the effectiveness of the results it produces.

Integer removal

Due to the huge size of some of the datasets in this research, the INEX IEEE, the

INEX 2007, and the INEX 2009 datasets, a very large number of unique terms occur.

Hence, it is essential to reduce the dimension of the dataset without incurring information

loss. A careful analysis of the dataset revealed that there were a large number of integers

and they did not contribute to the semantic of the documents, so they were removed in

the pre-processing step.

154

Shorter length words removal

Based on the analysis of the datasets involved in this research, words with fewer than

4 characters were considered as meaningless, thus, they were removed.

5.3.3 Representation of the structure-constrained content in ICF

After the pre-processing of the extracted content, the content is represented in a sparse

VSM representation which retains only the non-zero values along with the term id (indi-

cated in bold in Figure 5.3). This representation improves the efficiency in computation

especially for sparse datasets where the number of non-zeroes is less than the number of

zeroes.

d1 1 1 4 2 5 6 d2 1 3 2 1 4 2 d3 1 1 2 7 3 3 4 2 5 1 6 1

Figure 5.3: Sparse representation of a XML dataset modelled in VSM using theirterm frequency

Often the use of raw term frequency in the VSM suffers from the major issue that

all the terms are considered equally important. Therefore some terms have less or no

discriminating power while using these terms for clustering the documents. Consider

a document dataset from a publication domain in which the term “author” occurs in

almost every document. In order to reduce the effect of such frequently occurring terms,

weights are applied on the terms and these term weights are used instead of their raw

term frequencies in representing the terms in the matrix. Various schemes exist that are

used to compute the weights of the terms in the VSM. However, this research uses two

schemes, term frequency-inverse document frequency (tf-idf ) and Okapi BM-25, in order

to weight the terms in documents.

155

Term weighting

The most popular term weighting scheme is the term frequency-inverse document

frequency (tf-idf ) weighting. The weight vector for a given document Di is w(Di) =

{w1,i, w2,i, . . . , wn,i} where wt,i is the weight of the tk′k term and can be calculated as:

wtk′k ,i= tft ∗ log

|D||d : t ∈ d| (5.2)

where tft is the term frequency of a given term tk′k in document Di, log|D|

|d:tk′k∈d|is the

inverse document frequency (idf) and ‘*’ mean multiply. |D| is the total number of

documents in the XML dataset; |d : tk′k ∈ d| is the number of documents containing the

term tk′k . The idf of a rare term and of a frequent term are high and low respectively.

Another popular weighting scheme is the Okapi BM-25, which works on utilising simi-

lar concepts to those of tf-idf. However, this weighting scheme has two tuning parameters,

K1 and b. K1 and b influence the effect of term frequency and document length respec-

tively. The default values are K1 = 2 and b = 0.75. BM-25 weighting depends on three

parameters, Collection Frequency Weight (CFW), term frequency (tft-defined before) and

Normalized Document Length (NDL).

The Collection Frequency Weight (CFW) for the term tk′k is

CFW = log|D| − log(|d : tk′k ∈ d|) (5.3)

The Normalized Document Length for the given document Di is the ratio of the length

of the document Di over the average length of a document in D.

NDL(Di) =DL(Di)

avg(DL(D))(5.4)

156

where DL(Di) is the length of a document Di and avg(DL(D)) is the average length of a

document in D.

Combining these three parameters, the BM25 weighting for a given term tk′k is given

by,

Bf =CFW ∗ tft ∗ (K1 + 1)

K1 ∗ ((1 − b) + (b ∗NDL(Di))) + tft(5.5)

5.3.4 Similarity measures

Once the terms in the structure-constrained content are represented in the ICF , which

is a VSM, a clustering method is applied on the ICF to generate the required number of

clusters. The similarity between each pair of PCF s in the ICF is computed. Let there

be two vectors of terms, di and dj, in the given ICF matrix for two documents Di and

Dj respectively. The similarity between the two vectors, di and dj, is computed using the

cosine similarity function,

cos θ =di.dj

|di||dj |(5.6)

The repeated bisection partitional clustering method is used in this thesis [59]. This

method divides the ICF into two groups and then selects one of the groups according to a

clustering criterion function and bisects further. This process is repeated until the desired

number of clusters is achieved. During each step of bisection, the cluster is bisected so

that the resulting 2-way clustering solution locally optimises a particular criterion function

[59].

157

5.4 Using the Tensor Space Model (TSM)

Using the ICF in the VSM to represent the structure and the content of XML documents

implicitly might not be enough to model both the structure and the content features of

XML documents effectively. This is due to the loss of direct or explicit mapping between

the structure and its corresponding content. Figure 5.4 shows that for a given document

5.4(a), and a set of concise frequent subtrees CF1 and CF2, representing the content in

the VSM using HCX-V captures the structure implicitly; however, the explicit relationship

between these two features is lost. For instance, if there is one document in the collection

which contains “John Murray” as the publisher name and another document in the same

collection that contains “John Murray” as the author name, the similarity of the terms in

the VSM could put these two documents together; however, the structure of the documents

makes their context different.

Book

Title Author

Name

<Book Id=B105> <Title> On the Origin of Species</Title> <Author>

<Name>Charles Darwin</Name> </Author><Publisher>

<Name>John Murray </Name> <Place> London</Place>

</Publisher><Year> 1859</Year></Book>

Origin Species Charles Darwin John Murray London

CF1

CFJ

Publisher

Name Place

CF1CF2

D1

D2

Dn

(b)

(c) (d)

(a) T1 TK

D1

Dn

T1 TKOrigin Species Charles Darwin

John Murray London

Figure 5.4: Comparison of VSM and TSM: (a) sample XML document; (b) concisefrequent subtrees; (c) Vector Space Model (VSM) for (a) and (b) using HCX-V; and(d) Tensor Space Model(TSM) for (a) and (b).

Thus, the content and the structure features inherent in an XML document should be

modelled in a manner that ensures that the mapping between the content of the subtree

158

could be preserved and used in further analysis. Hence in this section, a novel method

of representing the XML documents in the Tensor Space Model (TSM) and utilising the

TSM for clustering is proposed. In the TSM, the content corresponding to its structure

is stored, which helps to analyse the relationship between the structure and the content.

By utilising the TSM for clustering, not only the intrinsic properties of the structure and

the content features but also the relations between these two features are used.

5.4.1 Background

Firstly, the preliminaries of tensors are provided. Tensor notations and conventions used

in this research are akin to the notations used by previous works [37, 49, 57, 112]. Table

5.1 provides the tensor notations.

Table 5.1: Tensor notations and descriptions

Notations Definition and Description

a, b Scalars

a, b Vectors

A, B Matrix

Aij Element (i,j) in a matrix A

T Tensor shown using calligraphic fonts

Tijk An entry in tensor T×n n-mode product

o vector outer product

N Number of orders or modes or ways or ranks

5.4.1.1 Tensor concepts

The tensor concepts will be defined and described in this subsection.

Definition: Tensor

A tensor T is a multi-mode array. The mode of a tensor is the number of dimensions,

also known as orders or ways (used interchangeably in this thesis). Figure 5.5 compares

the mode-1 (vector), mode-2 (matrix) and mode-3 tensors.

159

Mode/Order/Way -1 -2 -3

Correspondence Vector Matrix Tensor

Example

DB

Terms

DM

DB

Terms

D1

Terms

Dn

Dn Do

cu

men

ts

CFS1

CFSJ

D1

t1 tK

Figure 5.5: Comparison of vector, matrix and tensor

Definition: Norm of a Tensor

The norm of a mode-n tensor T ∈ RI1×I2×...×IN is defined as the square root of the

sum of the squared entries(t) in the tensor T .

‖T ‖ =

√√√√I1∑

i1=1

I2∑i2=1

. . .

IN∑iN=1

t2i1i2...iN (5.7)

Definition: Tensor Fiber

A tensor fiber is an one-dimensional fragment of a tensor, obtained by fixing all indices

but one. These fibers are nothing but the higher-order or higher-mode analogue of rows

and columns, as shown in Figure 5.6. A column vector is a mode-1 or column fiber,

denoted by t:jk. A row vector is a mode-2 or row fiber, denoted as ti:k. Finally, the tube

vector is a mode-3 or tube fiber, denoted by tij:.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(a) Column fibers (b) Row fibers (c) Tube fibersFigure 5.6: Fibers of a mode-3 tensor

160

Definition: Tensor Slice

A tensor slice is a two-dimensional fragment of a tensor, obtained by fixing all indices

but two. The three types of slices of a mode-3 tensor T , horizontal, lateral, and frontal,

are denoted by Ti::, T:j:, and T::k respectively, as shown in Figure 5.7.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(a) Horizontal Slices (b) Lateral Slices (c) Frontal Slices

Figure 5.7: Slices of a mode-3 tensor

5.4.1.2 Tensor operations

The two main types of tensor operations related to this research will be discussed here.

They are:

� Matricization

� n-mode (matrix) product

Matricization

For some computations, it is necessary to treat the entire tensor in matrix form. To do

this, the process of matricization is applied by rearranging the elements of a tensor into

a matrix, as shown in Figure 5.8. Essentially, it means that the mode-2 fibers of T are

mapped to the rows of matrix T(1) and the modes-2 and -3 are mapped to the columns of

this matrix.

There are several ways [61, 63, 70] of ordering the columns for matricization but this

ordering is not important as long as the same ordering is retained in further calculations

161

T(1)TFigure 5.8: Mode-1 matricization of a mode-3 tensor

[63]. This process is also referred to as “unfolding” or “flattening”. There are three ways

of matricizing a mode-3 tensor [70], as shown in Figure 5.9.

m n o

p q r

s t u

v w x

a b c

d e f

g h i

j k l

T

T(1)=

xwvlkjutsihgrqpfedonmcba

xurolifcwtqnkhebvspmjgdaT(2)=

xwvutsrqponmlkjihgfedcbaT(3)=

Figure 5.9: mode-n matricization

n-mode (matrix) product

The n-mode (matrix) product of a tensor T ∈ RI1×I2×...×IN with a matrix X ∈ RJ×In ,

denoted by T ×nX with a tensor of size I1×I2× . . .×In−1×J×In+1× . . . IN . Essentially,

this means that each mode-n fiber is multiplied by the matrix X. Based on the elements

it can be given by

(T ×n X)i1...idjin+1...iN =

In∑in=1

ti1i2...iNxjin (5.8)

162

The n-mode (matrix) product of a tensor T with the matrix X is equivalent to multi-

plying X by the appropriate flattening of X which is expressed as:

Y = T ×n X ≡ Y(n) = XY(n) (5.9)

Some interesting points to note are:

� if the modes of multiplication are different then the order of multiplication is irrele-

vant as shown in the equation below :

T ×m P×n Q = T ×n Q×m P if (m �= n) (5.10)

� if a tensor T is multiplied with two matrices with the same mode n, then the following

equation holds true :

T ×n P×n Q = T ×n QP if (m = n) (5.11)

5.4.1.3 Tensor decomposition techniques

In order to analyse the tensors, decomposition techniques are applied in a manner similar

to SVD. SVD is a well-known factorisation given by

X = UΣVT (5.12)

SVD decomposes a matrix into a sum of mode-1 matrices. In other words, the matrix

X ∈ RI×J can also be expressed as a minimal sum of mode-1 matrices:

163

X = σ1(u1 ◦ v1) + σ2(u2 ◦ v2) . . . σr(ur ◦ vr) (5.13)

where ui ∈ RI and vi ∈ RJ and i = 1, 2, . . . , r. Also, ui and vi are the Ith and J th

columns of U and V. The numbers σi on the diagonal of the diagonal matrix Σ are the

singular values of X where r is the mode or rank of the matrix X.

Two of the main applications of decompositions are Principal Component Analysis

(PCA) and Latent Semantic Indexing (LSI). Extending SVD to higher-mode tensors is

complicated, since the mode concept for the tensors become indistinct.

In essence, the purpose of tensor decomposition is to rewrite the tensor as a sum of

mode-1 tensors. For a tensor T ∈ RI×J×K, it could be expressed as:

T = (u1 ◦ v1 ◦w1) + (u2 ◦ v2 ◦w2) . . . (ur ◦ vr ◦wr) (5.14)

where ui ∈ RI , vi ∈ RJ , and wi ∈ RK and i = 1, 2, . . . , r.

The minimum representation for the tensor SVD is not always orthogonal, which im-

plies that the vectors ui, vj , wk do not necessarily form orthonormal sets. For this reason,

tensor decomposition has no orthogonality constraint imposed on these vectors.

Tensor decompositions enable an overview of the relationships that can be further

used in clustering. There are several tensor decomposition techniques, amongst which the

most popular are the CANDECOMP/PARAFAC (CP) [61] and Tucker [113] decomposi-

tions. CP decomposes a tensor as a sum of rank-one tensors (or vectors), and the Tucker

decomposition is the higher-order form of principal component analysis [63].

164

CP decomposition of a tensor T is given by

T ≈m∑r=1

ar ◦ br ◦ cr (5.15)

where m is a positive integer, ◦ represents vector outer product (which means that each

element of the tensor is the product of its corresponding vector elements) and ar ∈ RI ,

br ∈ RJ , and cr ∈ RK .

Tucker decomposes a tensor into a core tensor multiplied (or transformed) by a matrix

along each mode. Hence, in the three-way case where T ∈ RI×J×K, it becomes

T ≈ Y ×1 A×2 B×3 C =

P∑p=1

Q∑q=1

R∑r=1

gpqrap ◦ bq ◦ cr (5.16)

In this equation, A ∈ RI×P , B∈ RJ×Q, and C∈ RK×Q are the factor matrices

(which are usually orthogonal). These factor matrices are the principal components in

each mode. The tensor Y ∈ RP×Q×R is called the core tensor and its entries show the

level of interaction between the different components [63].

5.4.2 Modelling in tensor space – An overview

This subsection looks at modelling the XML documents in the TSM. Given the documents

set D, its corresponding set of Concise Frequent (CF ) subtrees and the set of terms for

each CF subtree (T ); the collection of XML documents is now represented as a mode-3

tensor T ∈ RD×CF×T as shown in Figure 5.10. The tensor is populated with the number

of occurrences of the structure-constrained terms tk ∈ {t1, . . . , tK} corresponding to the

CFj ∈ {CF1, . . . , CFJ} for a document Di ∈ {D1, . . . ,Dn}.

As mentioned in Chapter 2, TSM has a critical problem: a TSM is not scalable for very

165

CFS1

CFSJ

D1

Dn

Origin Species Charles Darwin

John Murray London

D

ocu

men

ts

Terms

t1 tK

Figure 5.10: Visualisation of a mode-3 tensor for the XML document dataset

large and dense datasets, since capturing all the terms corresponding to concise frequent

subtrees results in a very large-sized tensor. Hence, to alleviate the problem of scalability,

two optimisation techniques are applied on the dimensions, CF and T , to reduce the size

of the tensor.

Figure 5.11 provides an overview of the HCX-T method. It begins by grouping into

structural clusters the concise frequent subtrees generated using one of the proposed con-

cise frequent subtree mining methods. The use of structural clusters helps in grouping

similar concise frequent subtrees together. These structural clusters are then used to ex-

tract the content features from the documents. Once the structure and content features

are obtained for each document, the documents are represented in the TSM along with

their structure and content features. The next task is to decompose the created TSM to

obtain factorised matrix Uη. Lastly, the K-means algorithm or a partitional clustering

algorithm is applied on the left singular matrix for the “Document” dimension Uη to

obtain the clusters of documents.

166

Input : XML Document Dataset: D, Document Tree Dataset:DT, Number ofClusters:c, RI Vectors Length: γ, Concise Frequent Subtrees: CF={CF1, CF2, . . . , CFj} and Number of Required Dimensions: η

Output: Clusters: {Clust1, . . . , Clustc}1. Form clusters of similar concise frequent subtrees in CF ,CFSC = {CFSC1, . . . , CFSCh}, h � j where CFSCh

′′ = {CFjj , CFj′′′}, using

CLOPE algorithm ;2. for every document Di ∈ D do

Identify the CFSC existing in the document tree DTi ∈ DT , δ(DTi) ={CFSCl, . . . , CFSCh} ;for every CFSCj in δ(DTi) do

retrieve the structure-constrained content in Di,C(Di, CFSCj) = C(N1), . . . , C(Nm). The term setC(Nm) = {t1, . . . , tk} ∈ T , where T is the term list in D ;

end

end3. Apply random indexing using the γ length random vectors on the termscollection to reduce the term space to T ′ ;4. Form a tensor T ∈ RD×CFSC×T ′

, where each tensor element is the number oftimes a term tk′k occurs in CFSCj′j for a given document Di as represented in

C(Di, CFSCj′j );5. Apply the proposed tensor decomposition algorithm, PTCD to the tensor T andget the resulting left singular matrix Uη for UD ;6. Apply clustering on Uη to generate the c number of clusters ;

Figure 5.11: High level definition of HCX-T approach

167

5.4.3 Generation of structure features for TSM

One of the methods from the frequent pattern mining methods that were proposed in

Chapter 4, is used to generate concise frequent subtrees. However, as stated in [126],

a small change in the support threshold (particularly on the lower support threshold)

may generate hundreds of concise frequent patterns that cannot be pruned by employing

concise frequent pattern mining based only on the nature of the datasets. Capturing the

content with all the concise frequent subtrees and representing them in the TSM will be

more expensive than using the VSM, since in HCX-V the concise frequent subtrees for

each document are joined in its coverage, based on the structure of the document. The

content corresponding to the joined subtrees in that document is retrieved and represented

in the VSM.

Moreover, it can be seen from Figure 5.11 that checking whether the mined CF subtrees

exist in a given document tree or not is a computationally expensive operation. This

problem arises due to the graph isomorphism problem. This step can be optimised by using

a group of similar subtrees based on the similarity of the subtrees, and then retrieving the

content corresponding only to the group of similar CF , instead of comparing each CF

tree against all documents. This step helps to reduce the computational complexity of

checking whether the CF subtrees are present in a given document tree.

The CLOPE [130] algorithm for clustering transactional data has been modified to

include subtrees rather than items in order to conduct the substructure clustering of the

CF trees based on the similarity of the subtrees. The cluster of CF subtrees, Con-

cise Frequent Subtree Cluster (CFSC), becomes a tensor dimension for representing and

analysing XML documents. Let CFSC be a set of CF subtrees, where CFSCj is given

by {CFj′p , CFj′q , CFj′r}, CFj′p ,CFj′q ,CFj′r ∈ CF .

168

5.4.4 Generation of content features for TSM

The Concise Frequent Subtree Cluster, CFSC, representing the structure of the XML doc-

uments is used in retrieving the structure-constrained content from the XML documents.

The coverage of a CFSCj ∈ CFSC and its constrained content for the given document

Di will now be defined. Compared with the content features of an XML document, the

structure-constrained content features include the node values corresponding only to the

node labels of CF subtrees that form CFSCj.

Definition 1: Structure-Constrained content features of an XML document

according to the CFSC that covers it.

The structure-constrained content features of a given CFSCj, C(Di, CFSCj) of an

XML document Di, are a collection of node values corresponding to the node labels in the

CFSCj where CFSCj is a cluster of CF subtrees corresponding to DTi.

The node value of a node (or tag) of a CFSCj ∈ CFSC,C(Ni), in Di is a vec-

tor of terms {t1, . . . , tk} that the node contains. These terms are obtained after stop-

word removal and stemming. Firstly, the CF subtrees corresponding to the CFSCj =

{CFr, . . . , CFs} for a given document Di are flattened into their nodes {N1, . . . , Nm} ∈ N ,

where N is the list of nodes in DT. Then the node values of {N1, . . . , Nm} are accumulated

and their occurrences for a document Di are recorded.

In large datasets, the number of terms in the structure-constrained content is very

large in tensor space, as shown in Table 5.2 with actual values shown in parenthesis and

where “M” denotes Million entries. These are the number of terms obtained even after

stop word removal and stemming.

169

Table 5.2: Summary of the term size and tensor entries in INEX 2009 and INEXIEEE datasets

Datasets Term Size # Non-zero Tensor entries

INEX 2009 ≈ 1M (1,026,857) 127M (127,961,025)

INEX IEEE ≈ 0.18M(176,407) 14M(14,053,618)

Random Indexing

To reduce this very large term space, the popular term space reduction technique,

Random Indexing (RI) technique is applied. RI techniques have been favoured by many

researchers due to their simplicity and low computational complexity [75] in comparison

to other popular dimensionality reduction methods such as SVD and PCA that are com-

putationally expensive with about O(mn2) for a m × n matrix. In RI, each term in the

original space is given a randomly generated index vector as shown in Figure 5.12. These

index vectors are sparse in nature and have ternary values (0 , -1 and 1). Sparsity of the

index vectors is controlled via a seed length that specifies the number of randomly selected

non-zero features.

Equation 5.17 proposed by Achlioptas [3] is used to generate the distribution for creat-

ing the random index vector for every term in the structure-constrained content of CFSC.

rij =√3.

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

+1 with probability 1/6

0 with probability 2/3

−1 with probability 1/6

(5.17)

For a given document Di, the index vectors of length l for all the terms corresponding

to the given CFSCj are added. This concept of RI on a tensor can be explained using

Figure 5.12. Consider, a tensor= R3×2×4 (in Figure 5.12(a)) with 3 documents, 2 CFSC,

4 terms and 7 non-zero entries. The entries in the tensor correspond to the occurences

170

of a given term in the given CFSC for the document. Using Equation 5.17, the random

index vectors of length 6 for the 4 terms are generated as shown in Figure 5.12(b). Let us

consider the document D1 with three tensor entries a121 = 1, a123 = 1 and a124 = 1 that

correspond to CFSC2 and three terms Term1, Term3 and Term4. For the three terms

in D1 their random vectors (from Figure 5.12(b)) are added. The resulting vector (a12:)

for D1, given in Figure 5.12(c), contains two non-zero entries. The sparse representation

of the resulting vector is obtained by using only the non-zero entries in the vector, which

finally results in two tensor entries a123 = 1 and a124 = −1. Figure 5.12(d) shows the

final reduced tensor Tr in sparse representation containing 6 non-zero entries. From this

example, it can be seen that even for such a small dataset, the RI technique could reduce

the term space.

It can be seen that the number of entries in the randomly reduced T , Tr, is less than

its original T and that Tr maintains the shape of T as it retains the similarity that exists

between D2 and D3. The index vectors in RI are sparse; hence the vectors use less memory

store and they are added faster. The randomly-reduced structure-constrained content of

CFSC becomes another tensor mode for representing and analysing XML documents.

5.4.5 The TSM representation, decomposition and clustering

Given the tensor T , the next task is to find the hidden relationships between the dimen-

sions. A tensor decomposition algorithm enables an overview of the relationships that can

be further used in clustering. However, as already mentioned in Chapter 2, most of these

decomposition techniques cannot be applied on very large or dense tensor as the tensors

cannot be loaded into memory. To alleviate this problem, the tensors need to be built and

unfolded or matricized incrementally. The process of matricization or unfolding along the

1-mode of T will result in a matrix T(1). This means that the mode-1 fibers (higher order

171

1 -1 0 0 0 0

0 0 1 -1 0 0

-1 1 0 0 0 0

0 -1 1 0 0 0

1 0 0 -1 0 0

0 -1 0 1 0 0

T

a12:=

a21:=

a31:=

= a121=1; a123=1; a124=1;a212=1; a213=2; a312=1; a313=1;

(a)

(b)

(c)

r = a123=1; a124=-1; a211=1; a212=-1; a311=1; a312=-1;

(d)

1 -1 0 0 0 0

Term1=

Term2=

Term4=

Term3=

T

Figure 5.12: Illustration of Random Indexing (RI) on a mode-3 tensor resulting in arandomly reduced tensor Tr.

172

analogue of rows and columns) are set as the columns of the resulting matrix.

Now the proposed Progressive Tensor Creation and Decomposition (PTCD) shown in

Figure 5.13 is applied to progressively build and then decompose the tensor using Singular

Value Decomposition (SVD). PTCD takes as input the tensor data file, TF , that contains

the entries to build the tensor. The input also includes the size of the block that is used

to build the tensor and the number of modes in the given tensor.

Input : Tensor data File: TF, Block size: b, Number of modes (orders): M wherem ∈ {1, 2, . . . ,M} and Number of required dimensions: η

Output: Matricized Tensor : T and left singular matrix with η dimensions formode-1 : Uη

begin1. for every T(m) ∈ {T(1),T(2), . . ., T(M) } do

Initialize T(m) = φ ;

end2. Divide TF into blocks of size b ;3. for every block b do

Create tensor Tb ;for m = 1 to M do

//Matricize the tensor;

T′(m) = Unfold Tb along its mth mode;

//Update the mode-m matricized tensor;T(m) =T(m)+T′

(m);

end

end4. Compute SVD on T(1) with η dimensions, Tη

(1)= UηΣη VTη ;

end

Figure 5.13: Progressive Tensor Creation and Decomposition algorithm (PTCD)

The motivation for this new tensor decomposition algorithm is that the computations

by other decompositions store the fully formed matrices which are dense and hence cannot

scale to very large sized tensors. However PTCD stores the sparse matrices generated

progressively and enables further processing to be performed on the tensor. PTCD builds

the tensor progressively by unfolding the tensor entries for the user-defined block size b

to a sparse matrix T′(m), where mode m ∈ {1, 2, . . . ,M} and M is the number of modes.

173

This unfolded matrix, T′(m), is then used to update the final sparse matrix T(m). After

updating all the tensor entries to the final matrix T(m), it is then decomposed using SVD

which results in the left singular matrix, Uη and right singular matrix, Vη. The tensor

decomposition results in clustering of the data as theoretically and empirically proved by

Huang et al. [49]. K-means clustering is then applied on Uη to generate the required

number of clusters c. The purpose of using K-means clustering after tensor decomposition

is to cluster the non-clustered data into clusters and also to benchmark the clustering

solution against other state-of-the-art clustering methods.

5.5 Empirical evaluation

Having discussed the proposed clustering methods, HCX-V and HCX-T, in the previous

sections, this section conducts their empirical evaluation. The main focus of this section

is:

� to empirically investigate the accuracy of the clustering solution; that is, understand-

ing how the proposed clustering methods behave in different datasets;

� to understand how the implicit and explicit combination of structure and content of

XML documents works for the different real-life datasets;

� to provide a basis of comparison for different types of subtrees on the clustering

solution;

� to understand the scalability of the proposed clustering methods;

� to investigate the sensitivity of the parameters used in the proposed methods; and

� to analyse how the clustering solution will be useful in other application as in the

collection selection problem.

174

The experiments presented here illustrate the suitability of the proposed methodology

for various types of XML datasets detailed in Chapter 3. The ACM dataset is chosen to

understand how the proposed techniques scale for small-sized datasets and to evaluate the

clustering solution using categories generated from both the structure and the content of

the XML documents. The DBLP dataset is used to compare the efficiency of the proposed

clustering methods with shorter-length documents. Furthermore, the DBLP dataset con-

tains documents from varied sources such as conferences, books, journals, theses, technical

reports and web pages. INEX IEEE, INEX 2007 and INEX 2009 datasets were chosen to

demonstrate the applicability of the proposed methods on large real-life datasets based on

Wikipedia pages. These three datasets have also been used for benchmarking clustering

tasks in the INEX forum. Also, the clustering task submission results on these datasets

were used as benchmarks to evaluate the efficiency of the proposed clustering methods.

The following subsections conduct the analysis of the accuracy, sensitivity, time com-

plexity and scalability of the clustering methods.

5.5.1 Accuracy of clustering methods

One of the goals of developing the clustering methods is to improve the accuracy of the

clustering solution for a range of real-life datasets over other representations and state-

of-the-art clustering methods. This section conducts the analysis of the different types of

real-life datasets considered in this thesis, based on the accuracy of the clustering solution.

It uses the evaluation metrics detailed in Chapter 3, such as purity, F1-measure and NMI.

5.5.1.1 ACM dataset

Figure 5.14 shows the clustering results conducted on the ACM dataset using both HCX-V

and HCX-T for 5 categories.

175

0.6 0.80.7

0.75

0.8

0.85

0.9

0.95

Micro−Purity

Mac

ro−

Purit

y

0.4 0.6 0.8 1

0.4

0.5

0.6

0.7

0.8

0.9

1

Micro−F1

Mac

ro−

F1

HCX−T

HCX−V

MACH

Tucker

CP

S+C

CO

SO

(a) Purity (b) F1-measure

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

HCX-T HCX-V MACH Tucker CP S+C CO SO

NMI

(c) NMI

Figure 5.14: Results of clustering on the ACM dataset using 5 categories

The results clearly indicate that HCX-T performs better than HCX-V for various types

of subtrees. Also, the solution provided by using the Tucker decomposition is not suitable

for this dataset.

Figure 5.15 shows that most of the clustering methods provide a clustering solution

of 1 for the structure-based category as it is a simple one based on the DTD of the XML

documents. However, the S+C combination fails to provide a perfect solution, which

shows that it is not suitable if the clustering solution depends only on the structure of the

XML documents.

Figures 5.16 and 5.17 provide a comparison of the accuracy of the clustering solution

over the different types of concise frequent subtrees. In this dataset, it is interesting to

note that the use of CFI subtrees, rather than any other type of subtrees, improves the

176

1 1

0.79

0.52

0.71

0.940.87

11 1

0.7

0.5

0.9 0.940.84

1

0

0.2

0.4

0.6

0.8

1

1.2


Micro-purity Macro-purity

1 1

0.72

0.43

0.65

0.940.84

11 1

0.34 0.33 0.31

0.94

0.51

1

0

0.2

0.4

0.6

0.8

1

1.2


Micro-F1 Macro-F1

0

0.2

0.4

0.6

0.8

1

1.2


(a) Purity (b) F1

(c) NMI

Figure 5.15: Results of clustering on the ACM dataset using 2 categories

accuracy of the clustering solution in the TSM. However, the clustering solution produced

by HCX-T performs much better even with the CFI subtrees. On the contrary, HCX-V

performs better with CFIConst subtrees. This could be due to the presence of shorter

length implicit relationships between XML documents. The use of longer patterns present

in the CFI subtrees in HCX-V adversely impacts the accuracy of the clustering solution.

This shows the effectiveness of CFIConst subtrees over CFI subtrees, which, as shown in

the previous Chapter, are faster to generate.

0.75 0.75 0.75

0.71

0.91

0.79 0.8

0.71

0.79 0.79 0.79

0.7

0.9

0.83 0.83

0.76

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst


0.7 0.7 0.70.68

0.92

0.76 0.76

0.660.66 0.65 0.660.64

0.91

0.67 0.67

0.59

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95


Micro-F1 Macro-F1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


NMI

(a) Purity (b) F1

(c) NMI

Figure 5.16: Results of clustering on the ACM dataset using different types of subtreesfor HCX-V

177

0.91

0.75 0.75 0.76

0.88

0.68

0.62 0.61

0.92

0.79 0.79

0.850.88

0.62

0.77

0.62

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


NMI

NMI

0.91

0.78

0.7 0.69

0.88

0.64

0.540.59

0.91

0.65 0.66

0.33

0.89

0.57

0.42

0.54

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Micro-F1 Macro-F1

Figure 5.17: Results of clustering on the ACM dataset using different types of subtreesfor HCX-T

A comparison of HCX-T was conducted with and without the random indexing option,

to assess the impact of random indexing. From Table 5.3, it is clear that the random

indexing option is not only suitable for reducing the term-size, especially, that which is

required for large-sized datasets, but also improves the accuracy of the clustering solution.

Furthermore, the time taken to decompose, given by λd, is reduced for randomly indexed

datasets due to the reduced term size in the tensor as shown in Table 5.3. The impact of

dimensionality reduction using RI is also studied in Figure 5.18 on individual clustering

solution and they show clearly the reduction does not affect the quality of these solutions.

Table 5.3: Impact of dimensionality reduction on the clustering results on the ACMdataset

Methods Micro-purity Macro-purity Micro-F1 Macro-F1 NMI λd

HCX-T with RI 0.91 0.92 0.91 0.91 0.82 0.5HCX-T without RI 0.83 0.93 0.87 0.83 0.73 4.5

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

1 2 3 4 5

Cluster Number

Micro-purity(with RI)

Micro-purity(without RI)

Macro-purity(with RI)

Macro-purity(without RI)

0.7

0.75

0.8

0.85

0.9

0.95

1

1 2 3 4 5

Cluster Number

Micro-F1(with RI)

Micro-F1(without RI)

Macro-F1(with RI)

Macro-F1(without RI)

(a) Purity (b) F1-measureFigure 5.18: Impact of RI on the quality of individual clusters on ACM dataset

178

5.5.1.2 DBLP dataset

Next, the clustering methods are evaluated on the medium-sized DBLP dataset. Figure

5.19 shows that on the DBLP dataset, the HCX-T performs better than all the other

clustering methods. A comparison of the decomposition algorithms reveals that both CP

and PTCD perform equally well but Tucker fails to provide better results, especially with

the noticeable drop in Micro-purity, Macro-F1 and NMI values.

0.7 0.75 0.8 0.85 0.9 0.95 10.7

0.75

0.8

0.85

0.9

0.95

1

Micro−Purity

Mac

ro−

Purit

y

0.2 0.4 0.6 0.8 10.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Micro−F1

Mac

ro−

F1

HCX−T

HCX−V

MACH

Tucker

CP

S+C

CO

SO


0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75


NMI

(c) NMI

Figure 5.19: Results of clustering on the DBLP dataset

Figures 5.20 and 5.21 provide a comparison of the accuracy of the clustering solution

generated by the HCX-V and HCX-T using various types of subtrees. They clearly indicate

again the results confirmed in the ACM dataset, that CFIConst performs better than CFI.

It is interesting to note that, in HCX-T, with the use of CFE subtrees, the clustering

results are similar to those of using CFIConst. This could be due to the nature of this

179

0.9

0.95 0.96

0.84

0.950.93

0.95 0.950.92 0.92 0.91

0.83

0.910.94

0.91 0.91

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1



0.880.93 0.95

0.81

0.94 0.91 0.94 0.94

0.430.36

0.46

0.31

0.46

0.34

0.46 0.46

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Micro-F1 Macro-F1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


NMI

(c) NMI

Figure 5.20: Results of clustering on different types of concise frequent subtrees onthe DBLP dataset using HCX-V

dataset, with shorter trees resulting in fewer ancestor-descendant relationships.

0.90.92

0.96

0.87

0.980.96 0.95

0.93

0.89

0.930.95

0.88

0.940.91 0.91

0.87

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1



0.89 0.91 0.940.86

0.97 0.96 0.95 0.92

0.35 0.35 0.360.32

0.47 0.44 0.45 0.44

0

0.2

0.4

0.6

0.8

1

1.2


Micro-F1 Macro-F1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


NMI

(c) NMI

Figure 5.21: Results of clustering on different types of concise frequent subtrees onthe DBLP dataset using HCX-T

The results provided in Table 5.4 shows that on the DBLP dataset, without the random

180

indexing option, the quality of the cluster is improved and the time to decompose is also

reduced, confirming the results in ACM dataset.

Table 5.4: Impact of dimensionality reduction on the clustering results on the DBLPdataset

Methods Micro-purity Macro-purity Micro-F1 Macro-F1 NMI λd

HCX-T with RI 0.98 0.94 0.97 0.47 0.72 3.4HCX-T without RI 0.93 0.92 0.93 0.45 0.64 8.6

5.5.1.3 INEX2007 dataset

The results of clustering shown in Table 5.5 reveal that on the INEX 2007 dataset, HCX-V

performs much better than HCX-T. HCX-V is also better than other representations such

as S+C, SO and CO methods.

Table 5.5: Results of clustering on the INEX2007 datasetMethods Micro-Purity Macro-PurityHCX-T 0.27 0.28HCX-V 0.59 0.66S+C 0.36 0.43

MACH 0.25 0.25Tucker Fails FailsCP Fails FailsCO 0.55 0.64SO 0.25 0.26

CRP [131] 0.44 0.494RP [131] 0.42 0.49

GraphSOM [45] 0.26 0.27Word-descriptor [41] 0.58 0.67

Two main reasons emerge for the decreased accuracy of HCX-T:

� Nature of the categories in this dataset

� Nature of the XML documents

As mentioned in Chapter 3, 21 categories are based on the content and not on the

structure of the XML documents. Furthermore, the format of the XML documents is

in eXtensible HyperText Markup Language (XHTML) instead of being in XML format.

XHTML is an extension of HTML but uses XML format, so these documents have more

formatting tags and fewer semantic tags.

181

The very large number of formatting tags and the small number of semantic tags

have impacted on the performance of combining both the structure and the content of

XML documents for the HCX-T in this dataset. HCX-V utilises the structures implicitly

and the representation for clustering uses the content. This has resulted in HCX-V a

better performance than HCX-T. Furthermore, by avoiding the content from the infrequent

subtrees, clustering solutions produced by HCX-V have resulted in a better performance

than CO.

Figure 5.22 shows that amongst the different types of concise frequent subtrees, the

CFI subtrees perform better when these subtrees are used in clustering.

0.59

0.480.51 0.51 0.49

0.460.5 0.49

0.66

0.6 0.58 0.57 0.59

0.5

0.57 0.56

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7



0.53

0.42

0.47 0.460.43

0.39

0.440.42

0.27

0.220.24

0.22 0.23

0.15

0.220.2

0

0.1

0.2

0.3

0.4

0.5

0.6


Micro-F1 Macro-F1


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35


NMI

(c) NMI

Figure 5.22: Results of clustering methods using different types of concise frequentsubtrees on the INEX 2007 dataset using HCX-V

A comparison of the various subtrees using the TSM was not conducted as there was

minimal difference (≈ ± 0.002) in their accuracy.

182

5.5.1.4 INEX IEEE dataset

A comparison of the clustering methods on the INEX IEEE dataset shown in Table 5.6

indicates that the proposed HCX-T is not only scalable for this large-sized and deeper

structured dataset but also provides an accurate clustering solution over other clustering

methods that uses both structure and content.

Table 5.6: Results of clustering on the INEX IEEE dataset using 18 categoriesMethods Micro-F1 Macro-

F1HCX-T 0.35 0.34HCX-V 0.29 0.27S+C 0.27 0.21CP Fails Fails

Tucker Fails FailsMACH 0.21 0.20CO 0.19 0.16SO 0.14 0.12

Nayak et al. [31] 0.18 0.12Doucet et al. [31] 0.35 0.29SOM-SD [60] 0.38 0.34CSOM-SD [60] 0.13 0.09

This dataset contains categories based on both structure and content, and the results

clearly indicate that techniques which use only the content fail to provide better results.

This dataset has been chosen for sensitivity analysis and further results will be discussed

in Section 5.5.2.

5.5.1.5 INEX 2009 dataset

The experimental results for clustering on the INEX 2009 dataset shown in Table 5.7

reveal that HCX-V performs the best of all the methods. The clustering methods using

the decomposition algorithms Tucker and CP fail to scale for this dataset even with the

random indexing applied on it. This is due to the very large number of documents (54K);

however, the PTCD algorithm could decompose the large tensor in less than 100 seconds.

This shows the scalability of PTCD algorithm. However, the clustering results produced

by HCX-T are lower in quality in comparison to HCX-V. This shows that implicitly

capturing the relationship between the content and the structure is well suited for this

183

dataset. Though the tags are semantic in nature, there are very large number of tags

(34,686) and hence the direct mapping between the structure and the content is lost.

Table 5.7: Results of clustering on the INEX 2009 datasetMethods Micro-purity Macro-purityHCX-V 0.49 0.53HCX-T 0.39 0.40S+C 0.35 0.36Tucker Fails FailsCP Fails Fails

MACH 0.36 0.34CO 0.38 0.38SO 0.36 0.35

BilWeb-CO [81] 0.37 0.38

An evaluation of the clustering solution for the collection selection measure, NCCG, in

Figure 5.23 shows that again HCX-V performs better than other methods. By searching

only 20% of the documents, HCX-V could achieve NCCG scores of upto 0.8. Also, HCX-

T performs much better than the clustering methods using the linear combination of

structure and content features and methods using only one feature.

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

10 20 30 40 50 60 70 80 90 100

NC

CG

v

alue

s

% of documents searched

NCCG on INEX 2009 Wikipedia dataset

HCX-V

BilWeb-CO

HCX-T

CO

SO

S+C

Figure 5.23: A comparison of the NCCG values of the different clustering methodson the INEX 2009 dataset

A comparison conducted in Figure 5.24 on the number of clusters against the NCCG

values demonstrate that with the increase in the number of clusters the NCCG value drops.

This shows that for this dataset smaller number of clusters are suitable for the proposed

clustering methods.

In order to visualise the efficiency of the clustering methods with respect to the number

184

0 100 200 300 400 500 600 700 800 900 10000.68

0.7

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

Figure 5.24: A comparison of the number of clusters on the NCCG values

of clusters searched to identify the relevant documents, cumulative recall plots were used 1.

These plots use the percentage of clusters searched and the fraction of relevant documents.

In these plots, a recall value is observed only once the documents in the cluster were seen

which implies that until the first cluster is searched the recall value remains to be 0.

This plot, shown in Figure 5.25, clearly demonstrates that, the HCX-V achieves an early

and higher recall in comparison to other methods. It is also interesting to note that by

searching only 3% of clusters, nearly 75% of relevant documents can be found, which

proves that clustering hypothesis proposed by van Rijsenberg [92] holds.

A further analysis of the two large categories for the ad hoc queries with topic ids

2009005 and 2009043 (detailed in Table 5.8) is shown in Figures 5.26 and 5.27.

Table 5.8: Details of ad hoc queries with large categoriesTopic Id Title Query2009005 chemists physicists //article[about(.,periodic table elements chemists

scientists alchemists //physicists scientists alchemists)]periodic table elements (person|chemist|alchemist|scientist|physicist)

200904 NASA missions //group[about(.//missions, NASA)]

These Figures clearly shows that for topic ids 2009005 and 2009043, the HCX-V method

could identify 70% and 50% of relevant documents by searching only 1% of clusters re-

spectively. It should be noted that by identifying these relevant clusters, searching of the

1Detailed cumulative recall plots can be found in http://inex.de-vries.id.au/media/other/cumulative recall/subset/

185

entire document collection can be avoided which in turn helps to improve the efficiency of

retrieving the documents.

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

% of clusters searched

Frac

tion o

f relev

ant d

ocum

ents

HCX−V

HCX−T

S+C

BilWeb−CO

CO

SO

Figure 5.25: A comparison of the different clustering methods on the INEX 2009dataset using cumulative recall

0 1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1INEX 2009 distribution of relevant documents over clusters for topic 2009005

% Clusters searched

Fractio

n of re

levan

t docu

ments

Figure 5.26: Cumulative gain for the topic id 2009005

5.5.2 Sensitivity analysis

In order analyse the sensitivity of the proposed methods, INEX IEEE dataset was chosen.

There are two reasons for this choice: (1) it is one of the large-sized dataset having

longer patterns, and (2) the categories in the INEX IEEE dataset were based on both the

186

0 1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

INEX 2009 distribution of relevant documents over clusters for topic 2009043

% Clusters searched

Fractio

n of re

levan

t docu

ments

Figure 5.27: Cumulative gain for the topic id 2009043

structure and the content; this dataset was utilised to analyse the sensitivity of the length

constraint and min supp values on the purity. Experiments were conducted by varying the

length constraint (const) of the CF subtrees from 3 to 10 for support thresholds from 10%

to 30%. Figure 5.28 indicates that, with the increase in the length constraint, the micro-

purity and macro-purity values drop especially at the 10% and 30% support threshold.

Moreover, a length constraint of over 7 shows a negative impact on the purity. With longer

length patterns the content corresponding to the CF subtrees becomes specific and hence

results in less accuracy than the content corresponding to shorter subtrees. This shows

the suitability of constraining the CF subtrees in PCITMinerConst.

0.15

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0.25

3 4 5 6 7 8 9 10

Mic

ro

-pu

rit

y

Length of CFI subtrees

min_supp_10%

min_supp_20%

min_supp_30%

0.15

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0.25

3 4 5 6 7 8 9 10

Ma

cro

-pu

rit

y

Length of CFI Subtrees

min_supp_10%

min_supp_20%

min_supp_30%

(a) Micro-Purity (b) Macro-Purity

Figure 5.28: Sensitivity of length constraint on the micro- and macro-purity valuesfor INEX IEEE dataset

187

5.5.3 Time complexity analysis

This section will analyse the time complexity of HCX-V and HCX-T. That of HCX-V is

composed of two components, the extraction of content using concise frequent subtrees and

then application of clustering. The time complexity of HCX-V is determined as O(dm) +

O(dt) where d represents the number of documents, m is the number of concise frequent

subtrees, and t is the number of terms.

The time complexity of HCX-T comprises four major components: the clustering of CF

subtrees, random indexing, matricization and decomposition in PTCD. This is determined

as O(drp) +O(tkγ)+O(drγ)+O(dk′γ) where r is the number of structure-based clusters,

p is the number of similarity computation iterations, γ is the size of the random index

vector, k and k′ are the non-zero entries per column in the tensor before random indexing

and in the matricized tensor after random indexing respectively. The time complexity of

PTCD is O(drγ) + O(dk′γ), which includes the cost of matricization along the n-mode

and the sparse Singular Value decomposition (SVD) [16, 90] respectively.

5.5.4 Scalability analysis

All the real-life datasets, ACM, DBLP, INEX IEEE, INEX 2007 and INEX 2009, were

used for the scalability analysis, with the number of clusters set equal to the number of

categories (5, 8, 18, 21 and 40 respectively). This analysis was conducted using 1000

documents, with min supp at 10%, const at 5, and the value of γ at 100. The reason for

this setting is to understand how the proposed clustering method HCX-T performs with

PTCD for datasets of extreme nature.

It can seen from Figures 5.29, 5.30 and 5.31 that both HCX-T and PTCD scale nearly

linearly with the dataset size. The PTCD algorithm includes two main steps: (1) loading

188

0

1000

2000

3000

4000

5000

6000

7000

8000

2 4 6 8 10Ti

me

(in s

ecs)

Replication factor

ACM

DBLP

INEX IEEE

INEX2007

INEX 2009

Figure 5.29: Scalability of HCX-T

0

50

100

150

200

250

300

350

400

2 4 6 8 10

Tim

e (in

sec

s)

Replication factor

ACM

DBLP

INEX IEEE

INEX 2007

INEX 2009

Figure 5.30: Scalability of PTCD

0

5

10

15

20

25

30

2 4 6 8 10

Tim

e (in

sec

s)

Replication factor

ACM

DBLP

INEX IEEE

INEX 2007

INEX 2009

Figure 5.31: Scalability of the decomposition in PTCD

the tensor file into memory by matricization, and (2) decomposing the matrices using SVD.

It can be seen from Figures 5.30 and 5.31 that minimal time is spent on decomposition

while a large chunk of time on loading the tensor file into memory.

189

5.6 Discussion and summary

This section discusses and summarises the empirical evaluation conducted for the cluster-

ing methods based on the proposed HCX methodology.

� Comparison of Structure Only (SO), Content Only (CO) and Structure and Content

features for clustering and the effect of the proposed clustering methods HCX-V vs

HCX-T.

Evaluating the accuracy of the clustering solution that was produced in all the

datasets using SO, CO, S+C, HCX-V and HCX-T reveal that the non-linear combi-

nation of the structure and the content features perform better than using only one

feature or combining the two features matrices together as in S+C method. This

shows that a relationship exists between the structure and the content of XML doc-

uments and hence by utilising which shows an improvement in the accuracy of the

clustering solution.

The experimental evaluation on the chosen five real-life datasets reveals that HCX-T

performs better in three datasets, ACM, DBLP and INEX IEEE, and HCX-V on the

rest, INEX 2007 and INEX 2009. It is interesting to note that the HCX-V method

performs better in comparison to HCX-T for the two Wikipedia datasets used in

this research.

This raises the issue of when to choose HCX-V over HCX-T for a new dataset. For

datasets with fewer semantic tags and where the desired grouping is more heavily

based on the content of the documents than the structure, then HCX-V can be

used. However, if the datasets have more semantic tags than syntactic tags and the

desired grouping is on both the structure and the content of the XML documents,

then HCX-T can be used.

190

� Comparison of the proposed methods over the state-of-the-art clustering methods

Figure 5.32 shows the comparison of the proposed clustering methods over the state-

of-the-art clustering methods. As these results are from the INEX forum, the clus-

tering submissions for each of the datasets were not evaluated on all the evaluation

metrics. For instance, the submissions for the INEX IEEE dataset were evaluated

only on Macro-purity, those for INEX 2007, on Micro- and Macro-F1 values, and

those for INEX 2009 on Micro- and Macro-purity values.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

HCX-V HCX-T CRP 4RP SOM BilWeb-Co

Micro-

INEX 2007

INEX 2009

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

HCX-V HCX-T Tran et al Doucet et al CRP 4RP SOM BilWeb-Co

Macro-

INEX IEEE

INEX 2007

INEX 2009

(a) Micro- values comparison (b) Macro- values comparison

Figure 5.32: Comparison of the proposed clustering methods over the state-of-the-artclustering methods on the large-sized datasets

From these results, it is clearly shown that HCX-V outperforms the state-of-the-art

clustering methods such as CRP, 4RP and SOM for the INEX 2007 dataset. It also

outperforms the Bilweb-Co method [81] on the INEX 2009 dataset. On the other

hand, HCX-T outperforms the methods proposed by Tran et al. (PCXSS) [84] and

Doucet et al. [36], which shows that the proposed clustering methods improve the

accuracy over other clustering methods that use only the structure features or that

linearly combine the structure and content.

� Comparison of the different types of node relationships (induced and embedded) of

the frequent subtrees in clustering

Figures 5.33 and 5.34 show the relative ranking of different types of concise frequent

subtrees on clustering. The relative ranking is obtained by setting a scale of 1 to 8

191

where the best performing method gets a value of 8 and the worst method gets a

value of 1 for both HCX-V and HCX-T. However, for the INEX IEEE and INEX

2009 datasets, the range was set from 1 to 4, as full concise frequent subtrees could

not scale to these datasets.

0

10

20

30

40

50

60

70

80

Rel

ativ

e R

anki

ng

Concise Frequent Subtrees

ACM Dataset

DBLP dataset

INEX 2007 Dataset

INEX IEEE Dataset

INEX 2009 Dataset

Figure 5.33: Comparison of different types of concise frequent subtrees on clusteringbased on datasets

0

10

20

30

40

50

60

70

80

ACM DBLP INEX 2007 INEX IEEE INEX 2009

Rela

tive

rank

ing

Datasets

CFI

MFI

CFE

MFE

CFIConst

MFIConst

CFEConst

MFEConst

Figure 5.34: Comparison of different types of concise frequent subtrees on clustering

These figures indicate that in most of the datasets using CFIConst subtrees for clus-

tering perform much better than type of subtrees. In spite of using CFI subtrees

for clustering provide the best result on the ACM dataset using HCX-T, the rel-

ative rank not only ranking those based on HCX-T but also on HCX-V in which

CFI subtrees perform poor. Although clustering methods using CFEConst subtrees

perform relatively better than using MFIConst in the three datasets ACM, DBLP

and INEX 2007, however it fails to provide similar results in INEX IEEE and INEX

192

2009 datasets.

These results demonstrate that, for the chosen XML datasets, in most of the in-

stances induced subtrees provide not only a more accurate solution than embedded

subtrees but also the time taken to extract the content is also less. This could be ei-

ther due to the extraction of too many hidden relationships, which results in adding

noise, or to the fact that these embedded relationships are not suitable to identify

the correct categories.

� Impact of the proposed tensor decomposition and state-of-the-art decomposition al-

gorithms

The comparison of performance of all the decomposition algorithms on various

datasets is shown in Figure 5.35, with relative ranks ranging from 1 to 4, with 4

being the best score. This shows that the PTCD algorithm, with its progressive

tensor building option and decomposition, is not only scalable but is also able to

provide more accurate results than the popular Tucker decomposition algorithm [113]

in particular.

0

2

4

6

8

10

12

14

16

18

20


Rela

tive

Rank

ing

Datasets

PTCD

CP

MACH

Tucker

Figure 5.35: Comparison of tensor decomposition algorithms

In the ACM and DBLP datasets, the CP decomposition algorithm performs on a

par with PTCD in HCX-T. In spite of its potential to provide accurate results, CP

fails to scale for all the large-sized datasets, even after reducing the term space to

193

approximately 1000 terms with a random indexing option.

Another decomposition algorithm, MACH [112], applies random indexing to reduce

the number of tensor entries especially designed to suit a dense dataset, though it

could scale for large-sized datasets. This often results in poor performance, due to

the fact that some datasets have small-sized documents in some categories, applying

this algorithm could therefore remove a number of tensor entries in these categories,

resulting in combining the small-sized categories together. This shows the scalability

and efficiency of PTCD over the state-of-the-art decomposition algorithms.

� Impact of Random Indexing in HCX-T

The option of random indexing is proposed for models using TSM to reduce the term

space for large-sized datasets. However, experimental results shown in Tables 5.3

and 5.4 reveal that the clustering results produced using RI are better than without

RI for the small and medium-sized datasets, ACM and DBLP datasets that do not

require size reduction. Apart from improving the accuracy, using RI helps to reduce

the decomposition times, even for the small and medium-sized datasets.

However, the interesting problem here in using RI is to how to choose the value γ

for the seed length in random indexing. The Johnson-Lindenstrauss result can be

used to get the bounds for γ as given by γ =⌈4(ε2/2− ε3/3)−1ln n

⌋[28].

� Impact of features reduction using HCX

One of the objectives of this research is to combine the structure and the content

of XML documents in a non-linear approach that aids in feature reduction. Hence,

in order to clearly understand the impact of the proposed HCX methodology, on

reducing the features, an analysis was conducted.

It is evident from Figures 5.36 and 5.37 that in spite of a drastic reduction in the

number of features for the proposed clustering methods, there is an improvement in

194

performance over the CO and S+C representations.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Aver

age o

f pur

ity, F

1 and

NM

I

Dataset

Evaluation on all datasets

HCX-V

HCX-T

CO

S+C

SO

Figure 5.36: A comparison of the average of all metrics in the chosen real-life datasets

1

10

100

1000

10000

100000

1000000

10000000


# Te

rms

Datasets

Comparison of # terms used for clustering

HCX-V

CO

S+C

Figure 5.37: A comparison of the number of terms in the chosen real-life datasets

� Comparison of different types of concise frequent subtrees on clustering

In most of the datasets, among the concise frequent subtrees, closed frequent induced

subtrees provide better results than other types of subtrees, except in the DBLP

dataset. Though MFI subtrees are fewer and could be produced in almost the same

time as CFI subtrees, their performance in clustering on some datasets is low, as

some of the common subtrees are not determined and their corresponding content is

therefore not used in clustering. The drop in performance is noticeable in the ACM

dataset with HCX-T and on the INEX 2007 with HCX-V methods.

Let us analyse the two concise representations, maximal and closed frequent subtrees,

195

to understand the poor performance of MFI over CFI. Let there be a document tree

dataset, DT on which by applying frequent subtree mining with a given support

threshold s results in a frequent subtree mining result set, O = {DT ′1,DT ′

2,DT ′3}.

It contains three frequent subtrees having a support of s, s and s + 1 respectively.

Among which, DT ′1 ⊂t DT ′

2, DT ′2 ⊂t DT ′

3 and DT ′3 ⊃t φ. Applying the definition

of the closed frequent subtrees on the frequent subtrees result set, O, it can be

found that supp(DT ′1) = supp(T ′

2) and DT ′3 ⊃t DT ′

1 so it can be removed from

the output as its supertree, DT ′3, includes the information contained in DT ′

1. As

supp(DT ′2) �= supp(DT ′

3), the frequent subtree DT ′2 is considered closed as there are

some document trees which contains (DT ′3) but not (DT ′

2). Also, no DT ′3 ⊃ φ exists

therefore DT ′3 is closed. Hence, DT ′

2 and DT ′3 are the two closed frequent subtrees

in DT .

The three subtrees T ′1, T

′2 and T ′

3 need to be checked to see which ones are maximal.

According to the definition of maximal frequent subtrees, T ′3 is the only maximal

frequent subtree, for the reason that T ′3 ⊃t T

′1 and T ′

2 and maximal frequent subtrees

do not consider the difference in support values. That is the reason for the reduced

number, in comparison to the closed subtrees (that is two). The total number of

output patterns is less considering the maximal frequent subtree representation;

however, this representation suffers from information loss. This shows that there

are only s number of document trees which contain T ′2. By using T ′

3 for extracting

content, the document trees that do not contain T ′3 but contains only T ′

2 will be

ignored. This results in information loss and hence could have resulted in the poor

accuracy discovered for the maximal subtrees in clustering.

Therefore, the comparison of these two concise representations reveals that though

the maximal frequent subtree representation provides a reduced pattern set, this

could result in information loss. Alternatively, the closed frequent subtree represen-

196

tation provides a concise pattern set without any information loss, since the closure

property eliminates only the redundant information. This clearly provides the reason

for the poor accuracy of maximal frequent subtrees over the closed frequent subtrees.

� Impact of Constraint Length of the frequent subtrees on clustering

The sensitivity analysis reveals that with the increase in constraint length, a drop

in performance occurs in the INEX IEEE dataset; however the drop is not notice-

able in other datasets. The time taken to perform the experiments has increased

dramatically as it takes longer to extract the content corresponding to longer length

subtrees. The Table 5.9 shows the lengths of constraints that gave the best results

for the concise frequent subtrees.

Table 5.9: Constraint lengths for the real-life datasets

Datasets const

ACM 22

DBLP 3

INEX 2007 18

INEX IEEE 5

INEX 2009 6

� Impact of support threshold

The sensitivity analysis shows that with an increase in support threshold over 30%

of the dataset, the accuracy drops. This is due to the fact that there are fewer

subtrees and the corresponding content is too little to provide a good clustering

solution. However, in most of the situations, there are fewer variations in accuracy

with the change in support threshold. It should be noted that decreasing the support

threshold results in a very large number of frequent subtrees; however, with the

strength of concise frequent subtrees, the large number of frequent subtrees can be

controlled. Also, the ability to mine at a lower support threshold helps in identifying

large number of fine-grained clusters.

� Comparison of the weighting schemes tf-idf and BM25

197

A comparison of the two weighting schemes, tf-idf and BM-25, as shown in Figure

5.38 reveals that BM-25 performs well in comparison to tf-idf in most of the datasets.

The results are noticeable in the INEX 2009 dataset, with about 5% improvement

over its counterpart tf-idf.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Aver

age o

f pur

ity, F

1 and

NM

I

Dataset

Weighting Schemes comparison

BM-25

TF-IDF

Figure 5.38: A comparison of the weighting schemes - tf-idf and BM-25

� Impact of clustering on information retrieval

With the availability of real ad hoc retrieval queries and their manual assessment

results for INEX 2009 dataset, this dataset has been chosen for evaluating clustering

on information retrieval. The results clearly shows that HCX-V is effective in clus-

tering by identifying relevant documents earlier than the other clustering methods.

This also shows the effectiveness of clustering on information retrieval.

5.7 Conclusion

This chapter has presented a clustering methodology which uses two models to combine

the structure and the content of the XML documents. It has also conducted a series

of experiments that empirically evaluates HCX in the context of clustering the XML

documents. The purpose of the experiments is to show the effectiveness of the proposed

methodology, accuracy, scalability and applicability on various real-life XML datasets.

198

The first section presented the overall HCX methodology, with each of the two models

described in detail in the subsequent two sections.

The contributions made in this chapter can be summarised as:

� a clustering method (HCX-V) using the VSM to combine the structure and the

content implicitly;

� a clustering method (HCX-T) that utilises the tensor model to efficiently combine

the content and structure features of XML documents;

� a randomised tensor decomposition algorithm for large size tensors; and

� an experimental analysis of the proposed methods for their accuracy, time complexity

and scalability.

The experimental results over the real-life datasets show that by adopting the HCX

methodology for clustering the XML documents, the accuracy of the clustering solution

can be improved as it captures non-linearly the relationship between the structure and

the content of the XML documents. Moreover, this methodology helps in reducing the

number of terms used for clustering and hence improves the performance of the clustering

methods.

The successful application and competitive results of HCX, even on very large-sized

datasets, demonstrates its effectiveness on real-life XML datasets, with a large number of

documents (such as INEX 2007 and INEX 2009), on datasets with longer trees (such as

the INEX IEEE and the INEX 2009), and on datasets whose categories are based even on

one feature (such as ACM with 2 categories (structure) and INEX 2007 (content)).

Most importantly, the consistency of the results obtained in both frequent pattern

mining and clustering methods shows that this complete framework is suitable for real-life

199

datasets. The improvement in accuracy of the clustering methods over many state-of-the-

art methods demonstrates that combining structure and content of XML documents for

clustering is useful in domains such as information retrieval for the problem of collection

selection.

200

Chapter 6

Conclusion

6.1 Overview

With the increasing popularity of XML, there is an explosion in the number of XML docu-

ments both in Internet and intranets. In order to effectively manage these large collections

of XML documents, clustering has been identified as an effective solution. However, there

exist several challenges in clustering of XML documents that need to be addressed in order

to provide an accurate clustering solution. Hence, the main objective of this research is to

improve accuracy by exploring and developing different clustering methods utilising both

the structure and the content of XML documents on real-life datasets.

This research has proposed clustering methods to capture the structure and the content

of XML documents. These clustering methods efficiently capture only the concise frequent

structures of the XML documents generated using the proposed frequent pattern mining

methods. Also, this research has investigated the effect of using the content-only, structure-

only, and content-and-structure features using various models for clustering the XML

documents. Furthermore, the research also applies the results of the proposed clustering

methods to improve the effectiveness of collection selection in ad hoc information retrieval.

201

6.2 Summary of contributions

This thesis provides an overview of the XML data, XML frequent pattern mining methods

and the clustering methods using various models such as the VSM and the TSM. Based

on a literature review of current work, the following shortcomings were noted:

� Lack of efficient concise frequent subtree mining methods suitable for real-life datasets;

� Lack of non-linear approaches to combine the structure and the content of XML

documents in clustering; and

� Lack of efficient dimensionality reduction methods to combat the increase in dimen-

sionality due to the combination.

This thesis focused on overcoming these shortcomings by proposing a number of effi-

cient concise frequent subtree mining methods and XML clustering methods using novel

representation models in this research.

The main contributions are summarised below:

� Developed concise frequent pattern mining methods based on different types of sub-

trees.

This thesis has bridged the gap found in the literature about frequent pattern min-

ing methods by proposing varied types of subtrees based on the node relationship

and conciseness. It has also proposed these methods using the prefix-based pattern

growth approach that is particularly suited for dense datasets. Furthermore, this

thesis has proposed methods to generate new types of concise frequent subtrees based

on their length, that are specifically appropriate for many large-sized datasets. It

has also compared the efficiency of the frequent pattern mining methods with many

state-of-the-art methods to portray the efficiency of the proposed methods.

202

� Evaluated the efficiency of concise frequent subtrees in clustering.

This research has evaluated the efficiency of utilising concise frequent subtrees in the

clustering process. It has also conducted an in-depth analysis of the nature of each

of the subtrees and their effectiveness in obtaining the clustering solution.

� Developed a novel methodology of combining the structure and the content of XML

documents both implicitly and explicitly.

Instead of adopting the traditional linear method for combining the structure and

content of XML documents, this thesis has proposed a novel methodology of using

structure features to derive content features. Doing so not only helps to capture the

relationship between these two features, but also reduces the dimension resulting

from the combination. The proposed non-linear approach could also be extended to

other domains for combining two or more features.

Also, in this thesis a novel method of using a multi-dimensional data model, the

Tensor Space Model (TSM), to capture the two features of the XML documents

explicitly has been proposed. The results of the experiments also indicate that

capturing the features explicitly helps in improving the accuracy.

� Proposed a progressive tensor decomposition algorithm to effectively scale to very

large numbers of documents.

This thesis has introduced a novel and scalable tensor decomposition method for

decomposing very large datasets. In many situations, particularly in large dense

datasets, tensors could not be used as the existing decompositions algorithms fail

to scale for them. However the proposed PTCD algorithm has made feasible the

decomposition of tensors that are built for even large-sized datasets with about

50K documents. Furthermore, the results obtained from small-sized datasets were

comparable with the popular decomposition algorithm, CANDECOMP/PARAFAC

203

(CP), results.

� Studied the impact of clustering on ad hoc retrieval using very large real-life dataset.

On the collection selection problem, the cluster hypothesis by Rijsbergen [92] was

studied using the relevant documents for a given query from manual assessors. The

overall results from this study established that

– this type of evaluation confirmed the cluster hypothesis holds; and

– the clustering solution using structure and the content of XML documents

provides better results than using only one feature.

6.3 Summary of findings

The main findings from this thesis can be summarized as:

� HCX provides improved accuracy (up to 12% improvement) compared to structure-

only, content-only, linear combination of structure and content and other state-of-

the-art approaches;

� HCX-T shows an improvement in accuracy over HCX-V for the XML Datasets that

have the following characteristics:

– A strong relationship between structure and content features;

– Categories relying upon using both structure and content; and

– More semantic tags than formatting tags;

� HCX-V is preferred over HCX-T for the XML datasets with the following charac-

teristics:

– A weak relationship between structure and content features;

204

– Categories based mostly on the content; and

– A combination of both formatting and semantic tags.

� With respect to the effectiveness of concise frequent subtrees based on node rela-

tionship in clustering, induced is preferred over embedded subtrees. In particular,

Closed Frequent Induced (CFI) subtrees are preferred over Maximal Frequent In-

duced (MFI) subtrees since an information loss occurs with the use of MFI in clus-

tering. Additionally, the length-constrained subtrees demonstrate an improvement

in accuracy when the complete concise subtrees cannot be generated, especially for

larger trees.

� By eliminating the content corresponding to infrequent substructures, the dimen-

sionality of the input (content) data is reduced and the accuracy of the clustering

solution is improved.

6.4 Limitations and future extensions

This research focuses mainly on the clustering of XML documents using a tree-based

model. Several extensions can be made to improve these currently proposed methods in

the future, such as extending the clustering methods so that they could be applied to

different types of semi-structured data.

Since this research has explored the use of structure and content of XML documents

for clustering; future research should focus on using other features such as links between

the documents and semantic relationships between the documents while creating the TSM

to improve the accuracy further and while studying the impact of these features on the

clustering accuracy of these documents.

The accuracy of the proposed methods suffers on datasets which contain more for-

205

matting tags than semantic tags; hence future work can include an automatic semantic

tagging system to create more meaningful tags. The YAGO ontology used to create se-

mantic tags in the INEX 2009 dataset could be used to create semantic tags for the other

XML datasets.

The proposed frequent pattern mining methods were able to mine a large document

collection with a good response time; however, there is always room for improvement, espe-

cially in efficiency. With the improvement in computing resources the efficiency of frequent

pattern mining methods could be further improved by parallelising them. The nature of

frequent pattern mining methods based on prefix-pattern growth approach strongly sup-

ports the idea that each of the projections based on the frequent patterns could be mined

in parallel. This idea will help improve the performance of these methods and alleviate

the problem of running these frequent pattern mining methods on a single machine.

Further future work could also include frequent mining for patterns such as concise

sequential subtrees and episode subtrees from datasets which contain the sequential infor-

mation. This thesis has utilised frequent subtree mining in clustering; future works will

also focus on creating tree-based association rules with structure and content information

to aid in information retrieval. Also, future works will attempt to utilise Non-negative

Matrix Factorization (NMF) in the proposed PTCD algorithm instead of Singular Value

Decomposition (SVD) as there is no restriction of orthogonality constraint on the derived

semantic space [71]. Also, NMF utilises only non-negative values for all the latent semantic

vectors [125] thus it could further reduce the computational complexity.

206

6.5 Final remarks

Clustering of XML documents has been quite popular among researchers in recent years.

The newly proposed ideas of combining the structure and content of XML documents has

generated a great deal of interest for clustering XML documents. The use of frequent

subtree mining for generating tree summaries and utilising them for clustering opens the

possibility of gaining more useful representations. This research increases the potential

of applying frequent pattern mining to various other aspects of knowledge discovery and

data mining tasks. The importance of the work presented in this thesis, has been demon-

strated with publications in conferences, book chapters and workshops. This chapter has

summarised the key findings and the contributions it has made to the research commu-

nity. It has also identified various future extensions in both frequent pattern mining and

clustering that could be applied to the proposed methods.

207

Bibliography

[1] International standard ISO 8879: Information processing - text and office systems -

standard generalised markup language (SGML), 1986.

[2] S. Abiteboul, P. Buneman, and D. Suciu. Data on the web: from relations to semi-

structured data and XML. Morgan Kaufmann, San Francisco, California, 2000.

[3] D. Achlioptas and F. McSherry. Fast computation of low-rank matrix approxima-

tions. J. ACM, 54(2):1–19, 2007.

[4] C. C. Aggarwal, N. Ta, J. Wang, J. Feng, and M. J. Zaki. Xproj: a framework for

projected structural clustering of XML documents. In Proceedings of the 13th ACM

SIGKDD international conference on Knowledge discovery and data mining, pages

46–55. ACM, San Jose, California, USA, 2007.

[5] C. C. Aggarwal and H. Wang. Graph data management and mining: A survey of

algorithms and applications. In Managing and Mining Graph Data, pages 13–68.

2010.

[6] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery

of association rules. In Advances in knowledge discovery and data mining, pages

307–328. American Association for Artificial Intelligence, 1996.

209

[7] I. Altingovde, D. Atilgan, and A. Ulusoy. Exploiting index pruning methods for

clustering XML collections. In S. Geva, J. Kamps, and A. Trotman, editors, Focused

Retrieval and Evaluation, volume 6203 of Lecture Notes in Computer Science, pages

379–386. Springer Berlin / Heidelberg, 2010.

[8] A. Anagnostopoulos, A. Dasgupta, and R. Kumar. Approximation algorithms for co-

clustering. In Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART

symposium on Principles of database systems (PODS), pages 201–210, New York,

NY, USA, 2008. ACM.

[9] R. Anderson. Professional XML. Wrox Press Ltd, Birmingham, England, 2000.

[10] P. Antonellis, C. Makris, and N. Tsirakis. XEdge: clustering homogeneous and

heterogeneous XML documents using edge summaries. In Proceedings of the 2008

ACM symposium on Applied computing, SAC ’08, pages 1081–1088, New York, NY,

USA, 2008. ACM.

[11] H. Arimura and T. Uno. An output-polynomial time algorithm for mining frequent

closed attribute trees. In S. Kramer and B. Pfahringer, editors, Inductive Logic Pro-

gramming, volume 3625 of Lecture Notes in Computer Science, pages 1–19. Springer

Berlin / Heidelberg, 2005.

[12] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, and S. Arikawa. Efficient

substructure discovery from large semi-structured data. In SIAM International Con-

ference on Data Mining (SDM), 2002.

[13] T. Asai, H. Arimura, T. Uno, and S. Nakano. Discovering frequent substructures in

large unordered trees. In The 6th International Conference on Discovery Science,

2003.

210

[14] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha. A generalized

maximum entropy approach to bregman co-clustering and matrix approximation.

In Proceedings of the tenth ACM SIGKDD international conference on Knowledge

discovery and data mining (KDD), pages 509–514, New York, NY, USA, 2004. ACM.

[15] M. W. Berry, Z. Drmac, and E. R. Jessup. Matrices, vector spaces, and information

retrieval. SIAM J. of Computing, 41(2):335–362, 1999.

[16] E. Bingham and H. Mannila. Random projection in dimensionality reduction: ap-

plications to image and text data. In Proceedings of the seventh ACM SIGKDD

international conference on Knowledge discovery and data mining (KDD), pages

245–250, New York, NY, USA, 2001. ACM.

[17] D. Braga, A. Campi, S. Ceri, M. Klemettinen, and P. L. Lanzi. A tool for extracting

XML association rules. In Proceedings. 14th IEEE International Conference on Tools

with Artificial Intelligence, 2002. (ICTAI 2002)., pages 57–64, 2002.

[18] B. Bringmann. To see the wood for the trees: Mining frequent tree patterns, 2006.

[19] L. Candillier, L. Denoyer, P. Gallinari, M. C. Rousset, A. Termier, and A. M. Ver-

coustre. Mining XML documents. 2007.

[20] S. H. Cha. Comprehensive survey on distance/similarity measures between probabil-

ity density functions. International J. Mathematical Models and Methods in Applied

Sciences, 1(4):300–307, 2007.

[21] Y. Chi, R. R. Muntz, S. Nijssen, and J. N. Kok. Frequent subtree mining - an

overview. J. Fundam. Inf., 66(1-2):161–198, 2004.

[22] Y. Chi, S. Nijssen, R. R. Muntz, and J. N. Kok. Frequent subtree mining- an

overview. In J. Fundamenta Informaticae, volume 66, pages 161–198. IOS Press,

2005.

211

[23] Y. Chi, Y. Yang, and R. R. Muntz. Indexing and mining free trees. In IEEE

International Conference on Data Mining (ICDM), pages 509–512, 2003.

[24] Y. Chi, Y. Yang, Y. Xia, and R. R. Muntz. CMTreeMiner: Mining both closed and

maximal frequent subtrees. In The Eighth Pacific Asia Conference on Knowledge

Discovery and Data Mining (PAKDD). 2004.

[25] R. Cover. Xml applications and initiatives. http://xml.coverpages.org/

xmlApplications.html, 2005.

[26] T. Dalamagas, T. Cheng, K. Winkel, and T. Sellis. A methodology for clustering

XML documents by structure. Information Systems, 31(3):187–228, 2006.

[27] T. Dalamagas, T. Cheng, K. Winkel, and T. K. Sellis. Clustering XML documents

by structure. In SETN, 2004.

[28] S. Dasgupta and A. Gupta. An elementary proof of the johnson-lindenstrauss lemma.

Technical report, 1999.

[29] C. De Vries and S. Geva. Document clustering with k-tree. In S. Geva, J. Kamps,

and A. Trotman, editors, Advances in Focused Retrieval, volume 5631 of Lecture

Notes in Computer Science, pages 420–431. Springer Berlin / Heidelberg, 2009.

[30] L. Denoyer and P. Gallinari. Report on the XML mining track at INEX 2005

and INEX 2006: categorization and clustering of XML documents. SIGIR Forum,

41(1):79–90, 2007.

[31] L. Denoyer, P. Gallinari, and A. M. Vercoustre. Report on the XML mining track

at INEX 2005 and INEX 2006. In INEX 2006, pages 432–443, Dagstuhl Castle,

Germany, 2006.

212

[32] M. Deshpande, M. Kuramochi, and G. Karypis. Frequent sub-structure-based ap-

proaches for classifying chemical compounds. In IEEE International Conference on

Data Mining (ICDM), pages 35–42, 2003.

[33] M. M. Deza and E. Deza. Encyclopedia of distances. Springer, 1 edition, July 2009.

[34] R. Diestel. Graph Theory (Graduate Texts in Mathematics). Springer, 3rd edition,

August 2005.

[35] D. Dongjie, M. Zhixin, X. Yusheng, and L. Li. Mining tree patterns using frequent

2-subtree checking. In Second International Symposium on Knowledge Acquisition

and Modeling, 2009. KAM ’09., volume 2, pages 162–165, nov. 2009.

[36] A. Doucet and M. Lehtonen. Unsupervised classification of text-centric XML docu-

ment collections. In 5th International Workshop of the Initiative for the Evaluation

of XML Retrieval, INEX, pages 497–509, 2006.

[37] A. Evrim, B, and Y. Lent. Unsupervised multiway data analysis: A literature survey.

IEEE Transformations on Knowledge and Data Engineering, 21(1):6–20, 2009.

[38] G. Flake, R. Tarjan, and M. Tsioutsiouliklis. Graph clustering and mininum cut

trees. Internet Mathematics, 1(4):305–408, 2003.

[39] C. Fox. A stop list for general text. ACM SIGIR Forum, 24(1-2):19–35, 1989.

[40] W. B. Frakes and C. Fox. Strength and similarity of affix removal stemming algo-

rithms. ACM SIGIR Forum, 37(1):26–30, 2003.

[41] N. Fuhr, M. Lalmas, A. Trotman, and J. Kamps. Focused Access to XML documents.

In 6th International Workshop of the Initiative for the Evaluation of XML Retrieval,

INEX 2007. Selected Papers, Lecture Notes in Computer Science (LNCS), Dagstuhl

Castle, Germany, December 17-19, 2007.

213

[42] M. Garey and D. Johnson. Computers and intractability : a guide to the theory of

NP-completeness. W. H. Freeman, San Francisco, 1979.

[43] M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: a

system for extracting document type descriptors from XML documents. volume 29,

pages 165–176. ACM, 2000.

[44] S. Guha, R. Rastogi, and K. Shim. Rock: a robust clustering algorithm for categor-

ical attributes. pages 512 –521, Mar. 1999.

[45] M. Hagenbuchner, A. Tsoi, A. Sperduti, and M. Kc. Efficient clustering of structured

documents using graph self-organizing maps. In N. Fuhr, J. Kamps, M. Lalmas, and

A. Trotman, editors, Focused Access to XML Documents, volume 4862 of Lecture


[46] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.

In Proceedings of the 2000 ACM SIGMOD international conference on Management

of data, pages 1–12. ACM Press, Dallas, Texas, United States, 2000.

[47] D. Harman. How effective is suffixing? J. American Society for Information Science,

42(1):7–15, 1991.

[48] J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraphs in the

presence of isomorphism. In Proceedings of the IEEE International Conference on

Data Mining, pages 549–559. IEEE Computer Society, 2003.

[49] H. Huang, C. Ding, D. Luo, and T. Li. Simultaneous tensor subspace selection and

clustering: The equivalence of high order SVD and k-means clustering. In Proceeding

of the 14th ACM SIGKDD international conference on Knowledge discovery and

data mining (KDD), pages 327–335, New York, NY, USA, 2008. ACM.

214

[50] J. H. Hwang and K. H. Ryu. Clustering and retrieval of xml documents by structure.

In O. Gervasi, M. Gavrilova, V. Kumar, A. Lagan, H. Lee, Y. Mun, D. Taniar,

and C. Tan, editors, Computational Science and Its Applications ICCSA 2005,

volume 3481 of Lecture Notes in Computer Science, pages 925–935. Springer Berlin

/ Heidelberg, 2005.

[51] A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining

frequent substructures from graph data. In Proceedings of the 4th European Con-

ference on Principles and Practice of Knowledge Discovery in Databases (PKDD),

pages 13–23, Lyon, France, 2000.

[52] A. Inokuchi, T. Washio, and H. Motoda. A general framework for mining frequent

subgraphs from labeled graphs. Fundamenta Informaticae, 66(1-2):53–82, 2005.

[53] A. Inokuchi, T. Washio, K. Nishimura, and H. Motoda. A fast algorithm for mining

frequent connected subgraphs. Technical report, IBM Research, Tokyo Research

Laboratory, 2002.

[54] P. J., S. D. R., and K. U. EFoX: A scalable method for extracting frequent subtrees.

In International Conference on Computational Science, 2005.

[55] S. Jegelka, S. Sra, and A. Banerjee. Approximation algorithms for tensor clustering.

In Algorithmic Learning Theory, pages 368–383. 2009.

[56] J. W. Jian, W. K. Cheung, and X. O. Chen. Integrating element and term semantics

for similarity-based XML document clustering. IEEE / WIC / ACM International

Conference on Web Intelligence (WI), pages 222–228, 2005.

[57] S. Jimeng, T. Dacheng, P. Spiros, S. Y. Philip, and F. Christos. Incremental tensor

analysis: Theory and applications. ACM Trans. Knowl. Discov. Data, 2(3):1–37,

2008.

215

[58] T. G. K. and S. Jimeng. Scalable tensor decompositions for multi-aspect data mining.

In ICDM 2008: Proceedings of the 8th IEEE International Conference on Data

Mining, pages 363–372, December 2008.

[59] G. Karypis. CLUTO – software for clustering high-dimensional datasets, 2007.

[60] M. Kc, M. Hagenbuchner, A. C. Tsoi, F. Scarselli, A. Sperduti, and M. Gori. XML

document mining using contextual Self-organizing Maps for structures. In 5th In-

ternational Workshop of the Initiative for the Evaluation of XML Retrieval, INEX,

pages 510–509, Dagstuhl Castle, Germany, 2005.

[61] H. A. L. Kiers. Towards a standardized notation and terminology in multiway

analysis. J. Chemometrics, 14(3):105–122, 2000.

[62] N. Klarlund, T. Schwentick, and D. Suciu. Xml: Model, schemas, types, logics,

and queries. In J. Chomicki, R. van der Meyden, and G. Saake, editors, Logics for

Emerging Applications of Databases, pages 1–41. Springer, 2003.

[63] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM

Review, 51(3):455–500, 2009.

[64] S. B. Kotsiantis and P. E. Pintelas. Recent advances in clustering: A brief survey.

WSEAS Transactions on Information Science and Applications, 1:73–81, 2004.

[65] L. Kurgan, W. Swiercz, and K. J. Cios. Semantic mapping of XML tags using

inductive machine learning. In 11th International Conference on Information and

Knowledge Management (CIKM), Virginia, USA, 2002.

[66] S. Kutty, R. Nayak, and Y. Li. XML data mining: Process and applications. In

M. Song and Y. F. Wu, editors, Handbook of Research on Text and Web Mining

Technologies. Idea Group Inc., USA, 2008.

216

[67] S. Kutty, R. Nayak, and Y. Li. HCX: an efficient hybrid clustering approach for XML

documents. In DocEng ’09: Proceedings of the 9th ACM symposium on Document

engineering, pages 94–97, New York, NY, USA, 2009. ACM.

[68] S. Kutty, R. Nayak, and Y. Li. XCFS: an XML documents clustering approach using

both the structure and the content. In Proceedings of the 18th ACM conference on

Information and knowledge management (CIKM), CIKM ’09, pages 1729–1732, New

York, NY, USA, 2009. ACM.

[69] S. Kutty, T. Tran, R. Nayak, and Y. Li. Clustering XML documents using closed

frequent subtrees: A structural similarity approach. In N. Fuhr, J. Kamps, M. Lal-

mas, and A. Trotman, editors, Focused Access to XML Documents, volume 4862 of

Lecture Notes in Computer Science, pages 183–194. Springer Berlin / Heidelberg,

2008.

[70] L. D. Lathauwer, B. D. Moor, and J. Vandewalle. A multilinear singular value

decomposition. SIAM J. Matrix Anal. Appl., 21(4):1253–1278, 2000.

[71] D. Lee and W. W. Chu. Comparative analysis of six XML schema languages. In

ACM SIGMOD Record, volume 29, pages 76–87, 2000.

[72] L. M. Lee, L. H. Yang, W. Hsu, and X. Yang. Xclust: Clustering xml schemas for

effective integration. In CIKM’2002, Virginia, 2002, November.

[73] H. P. Leung, F. L. Chung, S. C. F. Chan, and R. Luk. XML document clustering

using common XPath. In International Workshop on Challenges in Web Information

Retrieval and Integration (WIRI ’05), pages 91–96, 2005.

[74] J. Lovins. Development of a stemming algorithm. Mechanical Translation and

computational Linguistics, 11(1):23–31, 1968.

217

[75] S. Magnus. An introduction to random indexing. In Methods and Applications of

Semantic Indexing Workshop at the 7th International Conference on Terminology

and Knowledge Engineering, TKE 2005, 2005.

[76] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval.

Cambridge University Press, NY, USA, 2008.

[77] Q. Mei and Y. Liu. A tree structure frequent pattern mining algorithm based on

hybrid search strategy and bitmap. In IEEE International Conference on Intelligent

Computing and Intelligent Systems, 2009. ICIS 2009., volume 1, pages 452–456,

2009.

[78] A. Mirzal. Weblog clustering in multilinear algebra perspective. International Jour-

nal of Information Technology, 15(1), 2009.

[79] C. H. Moh, E. P. Lim, and W. K. Ng. DTD-Miner: a tool for mining DTD from XML

documents. In the 2nd International Workshop on Advance Issues of E-Commerce

and Web-Based Information Systems, 2000.

[80] D. Muti and S. Bourennane. Survey on tensor signal algebraic filtering. Signal

Process, 87(2):237–249, 2007.

[81] R. Nayak, C. de Vries, S. Kutty, and S. Geva. Report on the XML mining track

clustering task at INEX 2009. In S. Geva, J. Kamps, and A. Trotman, editors,

Focused Retrieval and Evaluation. Springer, 2010.

[82] R. Nayak and W. Iryadi. XMine: A methodology for mining XML structure. In

X. Zhou, J. Li, H. Shen, M. Kitsuregawa, and Y. Zhang, editors, Frontiers of WWW

Research and Development - APWeb 2006, volume 3841 of Lecture Notes in Com-

puter Science, pages 786–792. Springer Berlin / Heidelberg, 2006.

218

[83] R. Nayak and W. Iryadi. XML schema clustering with semantic and hierarchical

similarity measures. Knowledge-Based Systems, 20(4):336–349, 2006.

[84] R. Nayak and T. Tran. A progressive clustering algorithm to group the XML Data

by structural and semantic Similarity. International Journal of Pattern Recognition

and Artificial Intelligence (IJPRAI), 21(4):723–743, 2007.

[85] R. Nayak and F. B. Xia. Automatic integration of heterogeneous xml-schemas. In

Proceedings of the International Conferences on Information Integration and Web-

based Applications and Services, 2004.

[86] R. Nayak and S. Xu. XCLS: A fast and effective clustering algorithm for heteroge-

nous XML documents. In Proceedings of the Pacific-Asia Conference on Knowledge

Discovery and Data Mining (PAKDD), pages 292–302, Singapore, 2006.

[87] A. Nierman and H. V. Jagadish. Evaluating structural similarity in XML documents.

In 5th International Conference on Computational Science (ICCS’05), Wisconsin,

USA, 2002.

[88] S. Nijssen and J. Kok. Efficient discovery of frequent unordered trees. In Proceedings

of International Workshop on Mining Graphs, Trees, and Sequences, 2003.

[89] C. D. Paice. Another Stemmer. ACM SIGIR Forum, 24(3):56–61, 1990.

[90] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent semantic

indexing: a probabilistic analysis. In Proceedings of the seventeenth ACM SIGACT-

SIGMOD-SIGART symposium on Principles of database systems, PODS ’98, pages

159–168, New York, NY, USA, 1998. ACM.

[91] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

[92] C. J. v. Rijsbergen. Information Retrieval. Butterworths, 1979.

219

[93] G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-

Hill, Inc., New York, NY, USA, 1986.

[94] G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-

Hill Book Co., New York, 1989.

[95] G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing.

Communication of ACM, 18(11):613–620, 1975.

[96] T. M. Selee, T. G. Kolda, W. P. Kegelmeyer, and J. D. Griffin. Extracting clusters

from large datasets with multiple similarity measures using IMSCAND. In M. L.

Parks and S. S. Collis, editors, CSRI Summer Proceedings 2007, Technical Report

SAND2007-7977, Sandia National Laboratories, Albuquerque, NM and Livermore,

CA, pages 87–103, December 2007.

[97] Y. Shen and B. Wang. Clustering schemaless XML document. In 11th International

Conference on Cooperative Information System, 2003.

[98] J. M. Smith and R. Stutely. SGML: The User’s Guide to ISO 8879. Ellis Horwood

Ltd, Chichester, 1988.

[99] S. Sol. Advantages of XML:moving beyond format, 1998.

[100] I. Stuart. XML Schema: a brief introduction, 2004.

[101] F. M. Suchanek, A. S. Varde, R. Nayak, and P. Senellart. The hidden web, xml and

the semantic web: scientific data management perspectives. pages 534–537. ACM,

2011.

[102] J. T. Sun, Z. Chen, H. J. Zeng, C. Lu, Yu, C. Y. Shi, andW. Y. Ma. Supervised latent

semantic indexing for document categorization. In IEEE International Conference

on Data Mining (ICDM), pages 535– 538, 2004.

220

[103] A. Tagarelli and S. Greco. Toward semantic XML clustering. In Proceedings of

SIAM International Conference on Data Mining (SDM), pages 188–199, 2006.

[104] A. Tagarelli and S. Greco. Semantic clustering of XML documents. ACM Transac-

tions on Information Systems, 28(1):1–56, 2009.

[105] H. Tan, S. T. Dillon, F. Hadzic, L. Feng, and E. Chang. MB3 Miner: mining eM-

Bedded sub-TREEs using Tree Model Guided candidate generation. In Proceedings

of the 1st International Workshop on Mining Complex Data, held in conjunction

with ICDM05, 2005.

[106] H. Tan, T. Dillon, L. Feng, E. Chang, and F. Hadzic. X3-Miner: Mining patterns

from XML database. In Proceedings of Data Mining ’05, Skiathos, 2005.

[107] A. Termier, M.-C. Rousset, and M. Sebag. TreeFinder: a first step towards XML

data mining. In ICDM 2002. Proceedings. 2002 IEEE International Conference on

Data Mining, 2002., pages 450–457, 2002.

[108] A. Termier, M.-C. Rousset, M. Sebag, K. Ohara, T. Washio, and H. Motoda. Effi-

cient mining of high branching factor attribute trees. In Fifth IEEE International

Conference on Data Mining., page 4 pp., 2005.

[109] T. Tran, S. Kutty, and R. Nayak. Utilizing the structure and content information for

xml document clustering. In S. Geva, J. Kamps, and A. Trotman, editors, Advances

in Focused Retrieval, volume 5631 of Lecture Notes in Computer Science, pages

460–468. Springer Berlin / Heidelberg, 2009.

[110] T. Tran and R. Nayak. Evaluating the performance of the XML document clustering

by structure only. In 5th International Workshop of the Initiative for the Evaluation

of XML Retrieval, INEX, pages 473–484, Dagstuhl Castle, Germany, 2006.

221

[111] T. Tran, R. Nayak, and P. Bruza. Document clustering using incremental and

pairwise approaches. In N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman, editors,

Focused Access to XML Documents, volume 4862 of Lecture Notes in Computer

Science, pages 222–233. Springer Berlin / Heidelberg, 2008.

[112] C. E. Tsourakakis. MACH: Fast randomized tensor decompositions. In The SIAM

Data Mining Conference (SDM), pages 689–700, Columbus, Ohio, USA, 2010.

[113] L. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika,

31:279–311, 1966.

[114] A. M. Vercoustre, M. Fegas, S. Gul, and Y. Lechevallier. A flexible structured-

based representation for XML document mining. In Advances in XML Information

Retrieval and Evaluation, pages 443–457. 2006.

[115] R. A. Wagner and M. J. Fischer. The String-to-String Correction Problem. J. ACM,

21:168–173, January 1974.

[116] J. W. W. Wan and G. Dobbie. Mining association rules from XML data using

XQuery. In Proceedings of the second workshop on Australasian information security,

Data Mining and Web Intelligence, and Software Internationalisation, pages 169–

174. Australian Computer Society, Dunedin, New Zealand, 2004.

[117] C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang, and B. Shi. Efficient pattern-growth

methods for frequent tree pattern mining. In H. Dai, R. Srikant, and C. Zhang,

editors, Advances in Knowledge Discovery and Data Mining, volume 3056 of Lecture


[118] J. Wang and J. Han. BIDE: Efficient mining of frequent closed sequences. In

Proceedings of the 20th International Conference on Data Engineering, pages 79–90.

IEEE Computer Society, 2004.

222

[119] Y. Wang, D. J. DeWitt, and J. Y. Cai. X-Diff: An effective change detection algo-

rithm for XML documents. In IEEE International Conference on Data Engineering,

2003.

[120] E. Wilde and R. J. Glushko. XML fever. J. Comm. ACM, 51:40–46, July 2008.

[121] P. Willett. The porter stemming algorithm: Then and now. Program: Electronic

Library and Information Systems, 40(3):219–223, 2006.

[122] C. N. Win and K. H. S. Hla. Mining frequent patterns from XML data. In 6th Asia-

Pacific Symposium on Information and Telecommunication Technologies (APSITT),

pages 208–212, 2005.

[123] Y. Xiao, J. F. Yao, Z. Li, and M. H. Dunham. Efficient data mining for maximal

frequent subtrees. In Proceedings of the IEEE International Conference on Data

Mining (ICDM), pages 379–386, Washington, DC, USA, 2003. IEEE Computer So-

ciety.

[124] G. Xing, Z. Xia, and J. Guo. Clustering XML documents based on structural

similarity. In R. Kotagiri, P. Krishna, M. Mohania, and E. Nantajeewarawat, editors,

Advances in Databases: Concepts, Systems and Applications, volume 4443 of Lecture


[125] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix

factorization. In Proceedings of the 26th annual international ACM SIGIR conference

on Research and development in informaion retrieval, SIGIR ’03, pages 267–273,

New York, NY, USA, 2003. ACM.

[126] X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: a profile-

based approach. In Proceedings of the eleventh ACM SIGKDD international con-

223

ference on Knowledge discovery in data mining (KDD), pages 314–323, New York,

NY, USA, 2005. ACM.

[127] X. Yan, J. Han, and R. Afshar. CloSpan: Mining closed sequential patterns in large

datasets. In SIAM International Conference on Data Mining (SDM), pages 166–177,

2003.

[128] J. Yang and X. Chen. A semi-structured document model for text mining. J.

Computer Science and Technology, 17(5):603–610, 2002.

[129] J. Yang, W. K. Cheung, and X. Chen. Learning the kernel matrix for XML document

clustering. In e-Technology, e-Commerce and e-Service, 2005.

[130] Y. Yang, X. Guan, and J. You. CLOPE: a fast and effective clustering algorithm

for transactional data. In Proceedings of the eighth ACM SIGKDD international

conference on Knowledge discovery and data mining (KDD), pages 682–687, New

York, NY, USA, 2002. ACM.

[131] J. Yao and N. Zarida. Rare patterns to improve path-based clustering of wikipedia

articles. In N. Fuhr, M. Lalmas, and A. Trotman, editors, Pre-proceedings of the

Sixth Workshop of Initiative for the Evaluation of XML Retrieval, pages 224–231,

Dagstuhl, Germany, 2007.

[132] J. Yao and N. Zerida. Rare patterns to improve path-based clustering. In 6th

International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX

2007, Dagstuhl Castle, Germany, Dec 17-19, 2007.

[133] J. Yoon, V. Raghavan, and L. Kerschberg. BitCube: clustering and statistical

analysis for XML documents. In Thirteenth International Conference on Scientific

and Statistical Database Management, Fairfax, Virginia, 2001.

224

[134] M. J. Zaki. Efficiently mining frequent trees in a forest. In Proceedings of the eighth

ACM SIGKDD international conference on Knowledge discovery and data mining

(KDD), pages 71–80. ACM Press, Edmonton, Alberta, Canada, 2002.

[135] M. J. Zaki. Efficiently mining frequent trees in a forest: Algorithms and applications.

IEEE Transactions on Knowledge and Data Engineering, 17(8):1021–1035, 2005.

[136] M. J. Zaki and C. C. Aggarwal. XRules: an effective structural classifier for XML

data. In Proceedings of the ninth ACM SIGKDD international conference on Knowl-

edge discovery and data mining (KDD), pages 316–325. ACM Press, Washington,

D.C., 2003.

[137] S. Zhang, M. Hagenbuchner, A. Tsoi, and A. Sperduti. Self organizing maps for the

clustering of large sets of labeled graphs advances in focused retrieval. volume 5631

of Lecture Notes in Computer Science, pages 469–481. Springer Berlin / Heidelberg,

2009.

[138] W. S. Zhang, D. X. Liu, and J. P. Zhang. A novel method for mining frequent

subtrees from XML data. In Z. R. Yang, H. Yin, and R. Everson, editors, Intelligent

Data Engineering and Automated Learning (IDEAL 2004), volume 3177 of Lecture


[139] Y. Zhao and G. Karypis. Data clustering in life sciences. Molecular Biotechnology,

31:55–80, 2005.

[140] L. Zou, Y. Lu, H. Zhang, and R. Hu. PrefixTreeESpan: a pattern growth algorithm

for mining embedded subtrees. In K. Aberer, Z. Peng, E. Rundensteiner, Y. Zhang,

and X. Li, editors, Web Information Systems WISE 2006, volume 4255 of Lecture


225

[141] L. Zou, Y. Lu, H. Zhang, R. Hu, and C. Zhou. Mining frequent induced subtrees by

prefix-tree-projected pattern growth. In Seventh International Conference on Web-

Age Information Management Workshops, 2006. WAIM ’06., pages 18–26, 2006.

226

Appendix A

Details of the real-life datasets

This section details the two categories used in INEX 2009 dataset. Tables A.1 and A.2

list the categories using the Wikipedia categories and the ad hoc queries ordered by their

topic Id.

227

Table A.1: Details of all the categories in INEX 2009 dataset using Wikipedia cate-gories

Id Category Number of documents1 People 153592 Society 126633 Geography 90654 Culture 90335 Politics 85896 History 80357 Nature 57888 Countries 57249 Applied sciences 556810 Humanities 520511 Business 373412 Technology 358413 Science 337814 Arts 283715 Historical eras 278016 Health 276017 Entertainment 252118 Belief 241719 Life 230120 Language 214021 Environment 211622 Places 193523 Fields of history 187624 Human geography 156425 Recreation 143126 Disambiguation pages 136827 Information 134228 Companies 133029 Vocabulary 128630 Pharmaceutical sciences 123831 Religion 121632 Science stubs 120233 Law 115834 Agriculture 114835 Biology 114336 Literature 113137 Debuts 111638 Military 109539 Space 104940 Government 1027

228

Table A.2: Details of all categories in INEX 2009 dataset using ad hoc queries orderedby the topic Id

Id Query Title # Documents2009001 Nobel prize 82009002 Best movie 22009005 Chemists physicists scientists alchemists periodic 82

table elements2009006 Opera singer Italian Spanish -soprano 72009010 Applications bayesian networks bioinformatics 22009011 Olive oil health benefit 82009012 Vitiligo pigment disorder cause treatment 72009013 Native American Indian wars against 33

colonial Americans2009020 IBM computer 52009022 Szechwan dish food cuisine 72009023 “Plays of Shakespeare”+Macbeth 162009026 Generalife gardens 22009028 Fastest speed bike scooter car motorcycle 62009033 Al-Andalus taifa kingdoms 82009035 Bermuda Triangle 92009036 Notting Hill Film actors 222009039 Roman architecture 272009040 Steam engine 252009041 The Scythians 32009042 Sun Java 12009043 NASA missions 1352009046 Penrose tiles tiling theory 52009047 “Kali’s child” criticisms reviews 2

Psychoanalysis of Ramakrishna’s mysticism2009051 Rabindranath Tagore Bengali literature 182009054 Tampere region tourist attractions 52009055 European union expansion 242009061 France second world war normandy 92009062 Social network group selection 12009063 D-Day normandy invasion 272009064 Stock exhange insider trading crime 92009065 Sunflowers Vincent van Gogh 12009066 Folk metal groups finland 12009068 China great wall 22009070 Health care reform plan 22009071 Earthquake prediction 22009073 Web link network analysis 12009076 Sociology and social issues and 14

aspects in science fiction2009079 Dangerous paraben bisphenol-A 32009082 South african nature reserve 12009087 History bordeaux 22009089 World wide web history 62009092 Ski +waxing -water -wave 12009093 French revolution 402009096 Eiffel 92009104 Lunar mare formation mechanism 62009105 Musicians Jazz 102009108 Sustainability indicators metrics 82009109 Circus acts skills 42009110 Paul is dead hoax theory 22009111 Europe solar power facility 22009113 Toy Story Buzz Lightyear 3D 2

rendering Computer Generated Imagery2009115 virtual museums 2

229

Appendix B

Empirical Evaluation of Frequent Mining

results

Table B.1: Runtime comparison of Length Constrained Subtrees on F5 datasetMin supp Const PCITMiner PMITMiner PCETMiner PMETMiner

Const const Const Const2 3 0.33 0.33 0.6 0.6

5 0.43 0.43 0.6 0.67 0.44 0.45 0.61 0.619 0.45 0.45 0.59 0.5911 0.45 0.44 0.58 0.58

4 3 0.3 0.32 0.34 0.345 0.36 0.36 0.4 0.47 0.37 0.37 0.4 0.49 0.37 0.37 0.39 0.3911 0.37 0.37 0.41 0.41

6 3 0.24 0.24 0.24 0.245 0.2 0.2 0.26 0.267 0.25 0.25 0.26 0.269 0.25 0.25 0.26 0.2611 0.26 0.26 0.26 0.26

8 3 0.25 0.25 0.24 0.245 0.24 0.24 0.24 0.247 0.24 0.24 0.24 0.249 0.24 0.24 0.23 0.2311 0.24 0.24 0.24 0.24

10 3 0.22 0.22 0.2 0.25 0.23 0.23 0.19 0.197 0.23 0.23 0.18 0.189 0.23 0.23 0.2 0.211 0.23 0.23 0.2 0.2

231

Table B.2: Length Constrained Subtrees in F5 datasetMin supp Const PCITMiner PMITMiner PCETMiner PMETMiner

Const const Const Const2 3 17 9 21 23

5 17 7 21 107 17 7 21 109 17 7 21 1011 17 7 21 10

4 3 8 6 14 115 9 4 11 57 9 4 11 59 9 4 11 511 9 4 11 5

6 3 5 3 8 55 6 2 7 27 6 2 7 29 6 2 7 211 6 2 7 2

8 3 4 3 6 45 4 2 5 27 4 2 5 29 4 2 511 4 2 5 2

10 3 3 2 4 24 5 2 4 26 7 2 4 28 9 2 4 211 3 2 4 2

232

Enriching XML Documents Clustering by using Concise ... · Sangeetha Kutty MCIS. (Auckland...

Documents

Transcript of Enriching XML Documents Clustering by using Concise ... · Sangeetha Kutty MCIS. (Auckland...