A framework for automated document storage and …...schema, but compensate with very efcient and...

Jonas Busschop, Tim Gaspard

A framework for automated document storage and annotation

Academic year 2016-2017Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Herwig BruneelDepartment of Telecommunications and Information Processing

Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Dr. ir. Mike Vanderroost, Dieter AdriaenssensSupervisors: Prof. dr. ir. Antoon Bronselaer, Prof. dr. Guy De Tré

Preface

This work is the result of a year long process in which we received the help of several people.

First of all, we would like to thank our supervisors, Prof. dr. ir. Antoon Bronselaer and Prof.

dr. Guy De Tré, as well as our counsellors, Dr. ir. Mike Vanderroost and Dieter Adriaenssens.

They gave us the guidance we required to bring this work to a successful conclusion, while still

allowing us the freedom to solve any problems with our own creativity.

Additionally, we would like to thank our family for their constant support, not only during

our dissertation year, but throughout our entire education. Jonas would also like to thank his

Dinnerclub and dormitory friends, for making his final year in university an enjoyable one.

Jonas Busschop & Tim Gaspard, June 2017

Usage permission

“The authors give permission to make this master dissertation available for consultation and to

copy parts of this master dissertation for personal use.

In the case of any other use, the copyright terms have to be respected, in particular with regard to

the obligation to state expressly the source when quoting results from this master dissertation.”

Jonas Busschop & Tim Gaspard, June 2017

A framework for automated document

storage

and annotationBy

Jonas Busschop

&

Tim Gaspard

Master’s dissertation submitted to obtain the academic degree of

Master of Science in Computer Science Engineering

Academic year 2016–2017

Supervisors: Prof. Dr. Ir. A. Bronselaer, Prof. Dr. Ir. G. De Tré

Counsellors: Dr. Ir.M. Vanderroost, D. Adriaenssens

Department of Telecommunications and Information Processing

Chair: Prof. Dr. Ir. H. Bruneel

Faculty of Engineering and Architecture

Ghent University

Abstract

In university research groups, document management is usually performed in an unstandard-

ised manner, usually on an individual basis. Combined with staff turnover and the resulting

disorganisation in their research, this makes it especially hard to retrieve old information and

projects. In this work, we propose a framework for document storage with a focus on relations

between files, in order to make research data easily available even after an extended period of

time. Additionally, the framework has learning components that automatically connect it to

other data in the database and give it an annotation. In this manner, this work tries to provide

a solution for the data management problems in research groups with a high regard for usability.

Keywords

Graph database, automated document tagging, content management system, machine learning,

collaborative filtering

A framework for automated document storage andannotation

Jonas Busschop, Tim Gaspard

Supervisor(s): Prof. dr. ir. Antoon Bronselaer, Prof. dr. Guy De TréCounsellors: Dr. ir. Mike Vanderroost, Dieter Adriaenssens

Abstract—In university research groups, document manage-ment is usually performed in an unstandardised manner, usuallyon an individual basis. Combined with staff turnover and theresulting disorganisation in their research, this makes it especiallyhard to retrieve old information and projects. In this work, wepropose a framework for document storage with a focus onrelations between files, in order to make research data easilyavailable even after an extended period of time. Additionally, theframework has learning components that automatically connectit to other data in the database and give it an annotation. Inthis manner, this work tries to provide a solution for the datamanagement problems in research groups with a high regard forusability.

Keywords—Graph database, automated document tagging,content management system, machine learning, collaborativefiltering

I. INTRODUCTION

Document management is an increasingly difficult thing toimplement in university research groups. The amount of datais ever increasing and staff turnover is a very relevant factor,as this leads to large amounts of often unstructured data thatis difficult to process. As such, there is a need for a clearand organized way to deal with this data, so that it can beincorporated in future research.

Content management systems offer a solution to these kindsof problems. However, data belonging to university researchgroups has some special requirements when it comes tomanagement: more often than not, there is no standardizedprocedure to handle or structure data. Most of the decisionsmade pertaining to data management is made on an individualbasis. This leads to a very heterogeneous data structure:differences in folder structure, used programmes and file types,document annotations, etc.

This work introduces a content management system thatcreates relations between documents and annotates them inan automated way, in an effort to provide a solution tothe aforementioned problems. When a file or document isuploaded, it will automatically be given a position in a graphdatabase, based on its relations to other files of the sameproject. This procedure standardizes the structure of the data,making it easier to retrieve relevant information. Additionally,the files are given an annotation in an automated way.

One of the challenges of this solution is the heterogeneousdata: because of the widely varying types of documents, it isvery hard to use the file contents as features. As such, we have

provided some other solutions: we have used file extensionsand meta data to represent documents. Additionally, we havetested adding the content of PDF files to the feature vector, asthis was the most prominent file type in the given data sets.

In order to place a given file in the existing data structure,the system assigns a parent to this file. This parent is decidedby a method based on collaborative filtering: the system findsthe most similar node in different projects, then calculates whatnode in the current project is most similar to that node’s parent.

Any given document uploaded to the system is classifiedin one of five classes: Article, Data, Figure, Presentation,or Meta. The classification was tested using three differentclassifiers: support vector machines (SVM’s), random forests,and logistic regression.

The remainder of this article is structured as follows. SectionII discusses the related work in the fields of content manage-ment systems and automated document tagging. Section IIIdescribes the methodology used in creating the system. SectionIV describes the results achieved in the learning components ofthe system. Finally, Section V provides an overall conclusionfor this work.

II. RELATED WORK

A. Content management systems

The type of content management system created in this workis a DAMS (Digital Asset Management System), where thefocus lies on optimal use, reuse, and repurposing of assets ordata [1]. In most content management systems, the contentis simply stored on a server. In order to access this content,metadata about each file is stored in a database [2] [3]. Themetadata can be used to query the database, as well as locatethe relevant files on the server.

B. Database technologies

1) Relational databases: Relational databases excel atkeeping large amounts of data persistent while using a stan-dardized relational model and query language, namely SQL(Structured Query Language). They provide access to datathrough the use of ACID (Atomicity, Consistency, Isolation,Durability) transactions [4]. They require knowledge of thestructure of data beforehand, as they make use of a standard-ized schema. Traditional relational databases do not scale wellhorizontally.

NewSQL is a relatively new category of data stores, pro-viding solutions for better horizontal scaling while retainingthe relational database model. They still require a predefinedschema, but compensate with very efficient and flexible query-ing [5].

2) NoSQL databases: NoSQL (Not only SQL) is an um-brella term that includes a large amount of very diversedatabase technologies that do not use the relational databasemodel. The largest difference with relational databases withrespect to this work is that they either have a flexible schemaor no mandatory schema at all [4]. There are four classes ofNoSQL database technologies:

1) Key value stores2) Document stores3) Column family stores4) Graph stores

In this work we use a graph store, as this type of databaseexcels when working with highly interconnected data [4].

C. Automated document tagging

In computer science, automated document tagging is theproblem of automatically attaching certain descriptive words,or tags, to a certain document. These tags should make itpossible for the reader to gauge what the document is aboutat a glance, without investing any significant amount of time,or to query the data. Machine learning techniques are used tolabel data with one or more tags. Generally, this is done intwo different steps.

1) Feature extraction and selection: features are extractedin one of three ways in automated document tagging sys-tems: based on the content, based on the structure, or basedon relations between meta data (graph-based). Content-basedfeature extraction is mostly used in text-heavy documents[6] [7]. Conversely, structure-based feature extraction is morecommon when working with documents that have an emphasison structure (such as XML or HTML) [8].

2) Document classification: the second step consists ofusing the selected features to classify documents in one ormultiple annotation classes. The model used for classificationheavily depends on the problem and data at hand, but themost commonly used classifiers in literature are support vectormachines (SVM’s), decision trees (or random forests usingdecision trees) and neural networks [6] [7] [9].

Another way to classify the data is to make use of collabora-tive filtering. This is a technique often used in recommendersystems, in which available information about similar usersor items is used to make recommendations or decisions forthe current user or item [10]. Collaborative filtering can eitherbe user-based or item-based. In the former version, the systemsearches users who share the same rating patterns as the user athand and then uses information about these users to calculate aprediction for the active user. In the latter, a similar approach istaken with items: first similar items are found, then informationabout these items is used to make a prediction.

III. METHODOLOGYA. Back-end structure

The structure of the data we are working with is notpriorly known. Additionally, the structure of different projectscan vary immensely and can always be changed (i.e., whenadditional documents are added). For these reasons, workingwith relational databases and their structured schemas is notdesirable.

In our system, the emphasis lies on the relationships be-tween the data files, and the dataset at our disposal is not largeenough to warrant the use of a column family store database.As such, we have opted to use Neo4j, an open source graphdatabase [11].

B. Feature selection

The first features we use are the file extensions of thedocuments in the data set. The information about a document’sextension is represented as an n-dimensional vector, where nis the total amount of distinct extensions in the dataset. Thevalues in this vector are binary, i.e., 0 of the file does not havethis extension and 1 if it does.

The second feature we have used is one we call filedepth.This is an attempt to put a value on the position of a documentin the folder structure. This position is considered as relativeto the document highest up in the folder hierarchy. This meansthe documents located closest to the root folder are given afiledepth value of zero. The filedepth values of documentslocated further down the folder hierarchy are incrementedfor each additional subfolder or hierarchy layer between saiddocument and the root folder.

As PDF files where the most prominent ones in the dataset,we have also tested features that represent the content of thesefiles. In order to do this, we have used the Python libraryPDFMiner [12]. A PDF file can be seen as different buildingblocks that are placed in the document. These blocks can havedifferent types of content, e.g., text, images, drawings, etc.PDFMiner extracts a value for ten different types of boxes,depending on the frequency of occurrence in the document.The normalized versions of these values are used as features.

Lastly, we use information about a file’s parent whenannotating the documents. This information is available, asthe placement in the graph database happens before theclassification does. The information that we use is the fileextension of a file’s parent, represented in the same way as itsown extension.

C. Automated graph placement

As mentioned earlier, the automated graph placement isperformed by selecting a parent node for the current file.This problem is not well suited for regular machine learningclassifiers, as the amount of possible classes (i.e., everydocument in a given project) would exceed the feature vectordimensionality. Additionally, the classifier would have to beretrained every time a single document was added to a project.In order to avoid these problems, the system uses an algorithmbased on collaborative filtering. Three different steps can beidentified in this approach:

TABLE I: The success rate of the parent selection algorithm when using only file extensions in the feature vector.

1 trainset 2 trainsets 3 trainsets 4 trainsets 5 trainsets 6 trainsets 7 trainsets random modelLaTeX project 1 20% 18% 22% 20% 20% 20% 12% 11%LaTeX project 2 20% 21% 25% 20% 25% 17% 19% 19%LaTeX project 3 13% 57% 55% 55% 65% 63% 63% 38%LaTeX project 4 30% 33% 37% 26% 41% 30% 41% 34%LaTeX project 5 9% 55% 54% 61% 22% 27% 26% 24%Bio-Engineering project 1 94% 82% 94% 84% 94% 65% 53% 53%Bio-Engineering project 2 34% 34% 38% 43% 36% 28% 10% 34%Bio-Engineering project 3 25% 25% 25% 28% 25% 24% 23% 27%

1) Find the node in the existing database with the highestsimilarity to the working node.

2) Find the parent of this node.3) Find the node in the working project with the highest

similarity to this parent node.

In order to find the most similar nodes, every node or file isrepresented by a feature vector as described in Section III-B.Then, the cosine similarity is calculated between the relevantfeature vectors.

In regular collaborative filtering applications, the solutionfor the most similar data point is chosen for the working datapoint as well. This is not applicable for this problem, however,as a node from another project can never be the parent node ofthe working document. As such, a second similarity measureis made between the parent node of the most similar node andall the files in the current project. This ultimately results in aparent node for the current document.

D. Document classification

We classify each file in one of five classes, and annotateit with the corresponding tag. These classes are Article, Data,Figure, Presentation and Meta. Three different classifiers weretested: an SVM classifier, a random forest classifier and alogistic regression classifier. Since the feature vector is quitesmall and the feature space is not very complex, the classifierswere not made too complex. We opted for a linear kernel forthe SVM classifier and a small amount of trees (50) for therandom forest classifier. Python’s Sklearn library was used toimplement these techniques.

IV. RESULTS

A. Training data

The data used to train on in this work consists of two distinctdata sets. The first is a collection of papers written in LATEX,the second is a collection of bigger projects provided by thefaculty of Bio-Engineering at Ghent University. It is worthnoting that the first data set is well structured, while the secondone is not organized.

Testing has always been done in the same way, both forparent selection and type classification. Every method andfeature vector has been evaluated in 7 stages. We started withone project, then incrementally add projects until all are usedfor training.

B. Parent selection

Three different feature vectors were used to test the successrate of our algorithm. In a first iteration, a document was rep-resented by a feature vector containing only its file extension.A second feature vector was made out of file extensions andthe filedepth feature. Finally, in a third test, the contents ofPDF files were added to the feature vector.

1) File extensions only: The results of the first test iterationare shown in Table I. The random model percentage wasdecided as picking the project node as parent every time,meaning this percentage is the relative amount of times theproject node was selected as a parent. With the exception ofthe last two projects, the algorithm is a definite improvementover the random model. The 2 final Bio-Engineering projectsperform worse here, as they are very different from the otherprojects in terms of file type composition. The first Bio-Engineering project still performs well because it is very smalland the project node is the predominant parent.

Two additional phenomena can be seen in this table:1) Some projects cause big fluctuations in percentage for

one another, while others do not affect the results at all.2) More datasets to train on does not necessarily mean a

better result; indeed, in some cases the percentage of cor-rect parent predictions drops after adding an additionaltraining set.

These phenomena are caused by the nature of the algorithm:the presumption is made that a similar node will have a similarparent, which is not always true. This means that a better resultwill be achieved when similar data sets or projects are usedfor training, rather than just more data sets or projects.

2) File extensions and filedepth: In the second testingiteration the previously explained filedepth feature was addedto the feature vector. The average difference in success rateper project is portrayed in Table II. In some projects, addingthe filedepth feature causes a large difference. In others, thedifference in success rate is minimal. When some documentstructure is present in the project, filedepth will have a largeimpact. If the document structure of the training sets and thatof the test set are similar, this impact will be a net positive.If the document structures are not alike, however, the impactwill be negative.

3) File extensions and PDF content: In the final test run,PDF content was used in the feature vector along with fileextension. The average difference in success rate per projectis shown in Table III. These differences are not very large.However, in some individual cases, larger fluctuations insuccess rate can be seen (6-10%). These fluctuations are

TABLE II: The average difference in success rate per projectwhen comparing a feature vector with just file extensions anda feature vector containing both file extensions and filedepth.

MeanImprovement

LaTeX project 1 5%LaTeX project 2 28%LaTeX project 3 -26%LaTeX project 4 27%LaTeX project 5 -2%Bio-Engineering project 1 4%Bio-Engineering project 2 2%Bio-Engineering project 3 -3%

caused by a mismatch in parents of PDF files: the results areimproved in cases where similar PDF files with similar parentsare present in the training sets, but are outnumbered by pdffiles with different parents.

TABLE III: The average difference in success rate perproject when comparing a feature vector with just file ex-tensions and a feature vector containing both file extensionsand PDF content.

AverageImprovement

LaTeX project 1 2%LaTeX project 2 0%LaTeX project 3 2%LaTeX project 4 1%LaTeX project 5 0%Bio-Engineering project 1 0%Bio-Engineering project 2 -1%Bio-Engineering project 3 1%

C. Document classification

1) Comparison of SVM, logistic regression and randomforest: As mentioned earlier, three different classifiers weretested: a support vector machine classifier, a random forestclassifier, and a logistic regression classifier. To test whichone works best for our specific problem, we have used thelargest project at our disposal (Bio-Engineering project 3) totrain the classifiers with, and have used the others as a testset. The results of this test are shown in Figure 1.

When using very simple feature vectors, the results of allthree classifiers is similar. When the feature vectors get morecomplex, however, the SVM classifier performs better than theother two. As such, we will be using this classifier for futureresults.

To get an estimate of the difficulty to assigning tags to thedata, a random model has been made. The random model isbased on the type that occurs most in Bio-Engineering project3. The percentage of this type in the projects that we test willthan be the accuracy of this model for that project. Table Vshows the amount of tags that would be correctly predictedaccording to this model. Because the first 5 test sets containalmost zero files with the same tag, the random model has apoor performance. The model performs better for the next 3datasets. From these percentages we can see that in general it ishard to predict the tags for the dataset. Even for a dataset thatresembles the training set the amount of correctly predictedtags is under 50%

(a) Extension only.

(b) Extension and parent extension

(c) Extension, parent extension and pdf content

Fig. 1: Accuracy of Type prediction when training withBiogenic amines part II

2) File extensions only: Table IV shows the predictionaccuracy when training and testing an SVM classifier withonly the file extension as feature vector. Here we can see thatfor the first 5 test sets decent results are obtained. The accuracywith a single training set is bigger than 60%. Training withmore than 1 training set even results in an accuracy that is over85% for all test sets. These high results can be explained bythe fact that train and test projects are similar. The other threeprojects, that don’t resemble the first 5 training sets obtaina far lower accuracy. For Bio-engineering project 1 and 2,with a large amount of training sets, a decent result is stillachieved (> 60%). The Bio-Engineering project 1 has a verylow accuracy even for large amount of training sets. This isdue the fact that it is a very small project with file extensionsthat do not often occur in the training sets.

3) File extension and parent extension: Table VI shows theaverage difference in success rate between training a featurevector on file extension only and a feature vector with fileextension and parent extension. Although the difference issmall, the mean difference is almost negative for every project.

TABLE IV: The success rate of the document classification algorithm when using only file extensions in the feature vector.

1 trainset 2 trainsets 3 trainsets 4 trainsets 5 trainsets 6 trainsets 7 trainsetsLaTeX project 1 86% 86% 86% 86% 86% 86% 90%LaTeX project 2 63% 89% 89% 90% 90% 95% 95%LaTeX project 3 89% 100% 100% 100% 100% 100% 100%LaTeX project 4 85% 100% 100% 100% 100% 100% 100%LaTeX project 5 68% 87% 87% 87% 90% 90% 90%Bio-Engineering project 1 12% 12% 12% 12% 29% 29% 29%Bio-Engineering project 2 30% 38% 38% 38% 67% 67% 72%Bio-Engineering project 3 41% 55% 55% 55% 61% 57% 60%

TABLE V: Random model for Bio-Engineering project 3

Random modelLaTeX project 1 0%LaTeX project 2 4%LaTeX project 3 0%LaTeX project 4 0%LaTeX project 5 4%Bio-Engineering project 1 47%Bio-Engineering project 2 41%Bio-Engineering project 3 37%

This means that adding the parent extension to the featurevector doesn’t increase the accuracy of the model.

The main reason for this is that the data that is available fortraining is of a limited amount and very homogeneous withineach project, meaning that the relationship between parent andchild is very specific for each project. Research should beperformed on bigger and more heterogeneous data sets.

TABLE VI: The average difference in success rate per projectwhen comparing a feature vector with just file extensions anda feature vector containing both file extensions and parentextension.

Mean ImprovementLaTeX project 1 0%LaTeX project 2 1%LaTeX project 3 -1%LaTeX project 4 -1%LaTeX project 5 1%Bio-Engineering project 1 -5%Bio-Engineering project 2 -3%Bio-Engineering project 3 -4%

4) File extension, parent extension and pdf content: Nowwe will add the pdf content to the feature vector and comparethis to the results obtained from the feature vector based ononly the file extension. Table VII shows the results. Here again,in general, the performance is worse than for file extensiononly.

In general this means that there is little difference in usingfile extension, parent extension and PDF content or only fileextension and parent extension. This leads to the conclusionthat there is no benefit in using the PDF content in its currentform. Looking at how the PDFminer library works and howwe constructed our feature vectors, there is no difference inweight between a single character in LTChar or LTAnno anda figure in LTFigure or LTImage or a drawing in LTDrawing.This leads to features that are nearly always heavily biasedtowards text areas, even for image-heavy PDFfiles.

5) Influence of training size: Figure 2 shows the graphfor the data of the Bio-Engineering project 2. It shows the

TABLE VII: The average difference in success rate perproject when comparing a feature vector with just file ex-tensions and a feature vector containing both file extensions,parent extension and pdf content.

Mean ImprovementLaTeX project 1 0%LaTeX project 2 -1%LaTeX project 3 -1%LaTeX project 4 -1%LaTeX project 5 1%Bio-Engineering project 1 +3%Bio-Engineering project 2 -3%Bio-Engineering project 3 -3%

importance of a large and diverse training set. Contrary to theparent selection algorithm, more training data nearly alwaysleads to a better performance here.

Fig. 2: Relevance of dataset size for training

V. CONCLUSION

In this work we created a framework for document storagein university research groups, using a graph database to putthe focus on the relations between documents.

A learning component was implemented to automaticallyassign a parent node to a new document uploaded to thisframework. There is a clear correlation between file extensionand finding the correct parent node. Adding more complexfeatures gave varying results; as such, future work should focuson finding better features and making a selection process forwhich data sets to use for training, as more data does notalways improve results.

Additionally, the documents added to the framework weregiven an annotation in an automated way. Out of the threeclassifiers tested, the SVM classifier provided the best results.As expected, there was a very big correlation between a

document’s annotation and its file extension. More complexfeatures did not usually improve success rates, however. Futureresearch in using the contents of prominent file types in aweighted way may improve results.

REFERENCES[1] F. Frey, Digital asset management—a closer look at the literature. PhD

thesis, Rochester Institute of Technology, 2004.[2] T. Gill, “Digital media asset management system and process,” Sept. 20

2005. US Patent 6,947,959.[3] K. C. Jones, C. K. Aggson, T. F. Rodriguez, B. Mosher, K. L. Levy,

R. S. Hiatt, and G. B. Rhoads, “Digital asset management and linkingmedia signals with related data using watermarks,” Aug. 22 2006. USPatent 7,095,871.

[4] K. Grolinger, W. A. Higashino, A. Tiwari, and M. A. Capretz, “Datamanagement in cloud environments: Nosql and newsql data stores,”Journal of Cloud Computing: Advances, Systems and Applications,vol. 2, no. 1, p. 22, 2013.

[5] E. Redmond and J. R. Wilson, Seven databases in seven weeks: a guideto modern databases and the NoSQL movement. Pragmatic Bookshelf,2012.

[6] Z. Liu, C. Shi, and M. Sun, “‘folkdiffusion: A graphbased tag suggestionmethod for folksonomies,” Information Retrieval Technology, vol. 6458,pp. 231–240, 2010.

[7] B. Choudhary and P. Bhattacharyya, “Text clustering using semantics,”1997.

[8] C. Garboni, F. Masseglia, and B. Trousse, “Sequential pattern mining forstructure-based xml document classification,” in International Workshopof the Initiative for the Evaluation of XML Retrieval, pp. 458–468,Springer, 2005.

[9] E. Gibaja and S. Ventura, “A tutorial on multilabel learning,” ACMComputing Surveys (CSUR), vol. 47, no. 3, p. 52, 2015.

[10] P. Resnick and H. R. Varian, “Recommender systems,” Communicationsof the ACM, vol. 40, no. 3, pp. 56–58, 1997.

[11] N. Developers, “Neo4j,” Graph NoSQL Database [online], 2012.[12] Y. Shinyama, “Pdfminer,” 2010.

CONTENTS i

Contents

1 Introduction 1

1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Requirement interviews 3

2.1 Interview process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Interview conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Content management systems: literature study 5

3.1 Background information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Database technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2.1 Relational databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2.2 NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3 Graph database technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.4 Back-end technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Automated document tagging: literature study 12

4.1 Background information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2 Technical overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2.1 Feature extraction and selection . . . . . . . . . . . . . . . . . . . . . . . 13

4.2.2 Document classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.3 Collaborative filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Solutions in literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3.1 Single-label learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3.2 Multi-label learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

CONTENTS ii

5 Back-end structure 26

5.1 Test architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.1 Used Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Layer overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.1 REST layer (back-end) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.2 Controller layer(back-end) . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.3 Conversion layer(back-end) . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.4 Model layer(back-end) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.5 Persistence layer(back-end) . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.6 Data layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.3.1 Deployment diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Automated graph placement 33

6.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.2 Parent selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.3 Type classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Data and results 42

7.1 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.1.1 Pre-processing of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.1.2 First dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7.1.3 Second dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.1.4 Comparison between datasets . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.2.1 Parent selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.2.2 Type classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8 Future work 73

8.1 Additional features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.1.1 Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.1.2 Authorization and authentication . . . . . . . . . . . . . . . . . . . . . . . 73

8.1.3 Back-end and database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8.2 Improving learning component . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

CONTENTS iii

8.2.1 Parent selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8.2.2 File annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

9 Conclusion 77

A Figures en tables 79

A.1 SVM vs logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

A.2 SVM vs random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

B Interviews 86

B.1 Interview process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

B.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

B.2.1 Interview with Iris Tavernier . . . . . . . . . . . . . . . . . . . . . . . . . 87

B.2.2 Interview with Carl Lachat . . . . . . . . . . . . . . . . . . . . . . . . . . 88

B.2.3 Interview with Kurt De Mey . . . . . . . . . . . . . . . . . . . . . . . . . 89

B.2.4 Interview with Nathalie De Cock . . . . . . . . . . . . . . . . . . . . . . . 89

B.2.5 Interview with Angelique Vandemoortele . . . . . . . . . . . . . . . . . . 90

C Installation manual 91

C.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

C.2 Java 1.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

C.3 Anconda - Python 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

C.4 Neo4j 3.0.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

C.5 Maven 4.0.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

C.6 Postman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

LIST OF FIGURES iv

List of Figures

3.1 The four layers of impedance mismatch in relational dabase management systems

[4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 A comparison of the architecture of a key-value store and the architecture of a

traditional relational database [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 The architectural pattern of a document store database [4]. . . . . . . . . . . . . 9

3.4 An example of a very simple column family store with two columns as key [4]. . . 9

3.5 The architectural pattern of a graph store database [4]. . . . . . . . . . . . . . . 10

4.1 An example of a tripartite graph used in graph-based methods. The nodes here

consist of users, tags and documents [19]. . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Linear versus non-linear separable data . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Linear support vector machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4 Non linear support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.5 An example of a binary tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.6 An example of a sentence in the UNL representation [17]. . . . . . . . . . . . . . 20

4.7 An example of a frequent sub-tree in two XML documents [18]. . . . . . . . . . . 20

4.8 The flow of information that results in a recommendation in the AutoTag algo-

rithm [30]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.9 An example of a multi-label learning problem where the labels Yi are not only

dependent on the features Xi, but also on each other [32]. . . . . . . . . . . . . . 23

4.10 A visual representation of the personalized, multi-label tag recommendation prob-

lem [34]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1 JSON output in postman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 static diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

LIST OF FIGURES v

5.3 Deployment diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.1 PDF tree structure seen from PDFMiner[36] . . . . . . . . . . . . . . . . . . . . . 35

6.2 The working document is compared to all nodes belonging to other projects in

the database. The most similar node is selected, encircled in red on the figure. . 37

6.3 The parent node of the most similar document to the working document is re-

trieved. In this example the parent node is the project node itself, encircled in

red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.4 The node retrieved in the previous step is compared to every node belonging to

the working project, including the project node (as this node can be a parent too)

but excluding the working document. The node with the highest cosine similarity

is then chosen as the final parent node of the working document. In this case,

this is the project node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.5 The resulting relationship created by the parent-finding algorithm. . . . . . . . . 39

7.1 The number of occurrences in percentage of every type of file extension in the

first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.2 The number of occurrences in percentage of every type of file annotation in the

first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.3 The number of occurrences in percentage of every type of file extension of parent

nodes in the first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.4 The number of occurrences in percentage of every type of file annotation of parent

nodes in the first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7.5 The number of occurrences in percentage of every type of file annotation of PDF

files in the first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7.7 The number of occurrences in percentage of every type of file annotation in the

second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.8 The number of occurrences in percentage of every type of file extension of parent

nodes in the second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.9 The number of occurrences in percentage of every type of file annotation of parent

nodes in the second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

LIST OF FIGURES vi

7.10 The number of occurrences in percentage of every type of file annotation of PDF

files in the second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


Shrimp project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.12 A comparison of document composition between the first and second dataset. . . 52

7.13 The amount of correct predictions as a function of the amount of datasets used

to train, with A possibilistic view on set and multiset comparison as test set. . . 56

7.14 A comparison of the file type composition of three different datasets. . . . . . . . 57

7.15 The amount of correct predictions as a function of the amount of datasets used

to train, with Propagation of data fusion as test set. . . . . . . . . . . . . . . . . 58

7.16 The file extensions among parent nodes in two different projects. . . . . . . . . . 58

7.17 Comparison between the file extensions of parents of PDF files in the test set and

train sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.18 Accuracy of Type prediction when training with Biogenic amines part II . . . . . 65

7.19 The composition of extensions of parents in Biogenic amines part I and II. . . . . 69

7.20 Relevance of dataset size for training . . . . . . . . . . . . . . . . . . . . . . . . . 71

8.1 The login screen for CAS (Ghent University). . . . . . . . . . . . . . . . . . . . . 74

A.1 an example of a csv output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

LIST OF TABLES vii

List of Tables

6.1 Difference in prediction accuracy for 50 and 300 trees in random forest. . . . . . 41

7.1 The random model success rate per project for the parent selection algorithm,

achieved by always assuming the parent is the project node. . . . . . . . . . . . . 54

7.2 Parent node success rate using only file extensions as features. Only the first

choice is considered a correct result. . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.3 Parent node success rate using only file extensions as features. Both the first and

the second choice count toward a correct result. . . . . . . . . . . . . . . . . . . . 57

7.4 The difference in success rate when using both the first and the second choice,

when compared to only using the first choice when selecting a parent node. . . . 59

7.5 The success rate in selecting the parent of a node when using a feature vector

consisting of file extensions and filedepth. Only the first choice for the parent is

considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7.6 The success rate in selecting the parent of a node when using a feature vector

consisting of file extensions and filedepth. Both first and second choices for the

parent are considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.7 The difference in success rate for first choice only parent selection when using

only file extensions as features versus when using file extension and filedepth

as features. Green values indicate a better performance by the feature vector

including filedepth, red values indicate the contrary. . . . . . . . . . . . . . . . . 61

7.8 The success rate in selecting a file’s parents when the feature vector consists of

file extensions and the content of PDF files. Only the first choice for a parent

node is considered here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

LIST OF TABLES viii

7.9 The success rate in selecting a file’s parents when the feature vector consists of

file extensions and the content of PDF files. Both the first and the second choices

for a parent node are considered here. . . . . . . . . . . . . . . . . . . . . . . . . 62

7.10 The difference in success rate with and without including PDF content in the

feature vector. A green number indicates that including PDF content performed

better, a red number indicates it performed worse. . . . . . . . . . . . . . . . . . 62

7.11 Accuracy rate of the random model. . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.12 Accuracy rate of SVM type prediction with file extension as features. . . . . . . . 67

7.13 Accuracy rate of SVM type prediction with file extension and parent extension

as features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.14 Difference in accuracy between extension and extension together with parent ex-

tension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.15 Accuracy rate of SVM type prediction with file extension, parent extension and

pdf content as features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.16 Difference in accuracy between extension and extension together with parent ex-

tension and pdf content. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.1 Accuracy rate of logistic regression type prediction with file extension as feature. 80

A.2 Difference in accuracy SVM and logistic regression training on extension only . . 80

A.3 Accuracy rate of logistic regression type prediction with file extension and parent

extension as features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

A.4 Difference in accuracy SVM and logistic regression training on extension and

parent extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

A.5 Accuracy rate of logistic regression type prediction with file extension, parent

extension and pdf content as features. . . . . . . . . . . . . . . . . . . . . . . . . 82

A.6 Difference in accuracy SVM and logistic regression training on extension, parent

extension and pdf content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A.7 Accuracy rate of randomforest type prediction with file extension as feature. . . . 83

A.8 Difference in accuracy SVM and logistic regression training on extension only . . 83

A.9 Accuracy rate of random forest type prediction with file extension and parent

extension as features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

A.10 Difference in accuracy SVM and random forest training on extension and parent

extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

LIST OF TABLES ix

A.11 Accuracy rate of random forest type prediction with file extension, parent exten-

sion and pdf content as features. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

A.12 Difference in accuracy SVM and random forest training on extension, parent

extension and pdf content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

INTRODUCTION 1

Chapter 1

Introduction

1.1 Problem statement

Document management is an increasingly difficult thing to implement in university research

groups. The amount of data is ever increasing and staff turnover is a very relevant factor. This

leads to large amounts of often unstructured data that is difficult to process. There is a need

for a clear and organized way to deal with this data, so that it can be incorporated in future

research.

A possible answer to these problems comes in the form of a content management system.

When it comes to these systems, there is one requirement that is significantly more important

than the others: usability. If a new system is more difficult to use than an already existing one,

the new system is never going to be introduced.

Additionally, there are certain problems that arise when multiple users share a content

management system. These problems originate in the fact that different people have different

habits when it comes to data management, for example: a different folder structure, programmes

and file types, document annotations, etc.

1.2 Goal

In order to provide some solutions for the aforementioned problems, we introduce a content

management system with automated document tagging features in this master’s thesis. The

aim of this system is to ensure the fluid continuity of projects in research groups, with minimal

time wasted retrieving information and results gathered in past projects. To achieve this, we

want to minimize manual input and automate things wherever possible. In particular, there are

1.3 Thesis overview 2

two things we want to automate:

• The structure of the documents in a project. The system will automatically connect related

documents to one another, avoiding the problem of disorganized and differently structured

data.

• Document annotation. Each document will be given a tag that puts it in a distinct class.

This leads to a database that is easily queryable.

By automating these functions in a standardized way, we hope to achieve a database for use

in research groups which is clearly structured and easy to query. It is possible to retrieve all

metadata belonging to a certain project, or to retrieve all documents with a certain annotation.

All the end user has to do to work with the programme is create a project in the form of a

project node and start uploading data.

1.3 Thesis overview

First, chapter 2 discusses some interviews we conducted with members of the faculty of Bio-

Engineering at Ghent University, in order to get a better understanding of the requirements for

the system.

Chapter 3 contains a literature study on content management systems, specifically on back-

end and database technologies used in these systems.

Chapter 4 is a literature study on automated tagging. It describes a few general methods,

and then takes a look at more specific algorithms developed in recent literature.

Chapter 6 explains the approaches we selected to solve the learning problems in this thesis.

Chapter 7 gives an overview of the data used to test the system. Additionally, it shows

and explains the achieved results in solving the learning problems.

Chapter 5 presents an overview of the back-end and database structure of the system.

Chapter 8 outlines potential future improvements and features, especially with respect to

the requirements obtained from the interviews.

Finally, chapter 9 presents the conclusions made about the tested methods.

REQUIREMENT INTERVIEWS 3

Chapter 2

Requirement interviews

In order to know what requirements members of the faculty of Bio-engineering of Ghent Uni-

versity would want in a automated content management system, we conducted some interviews.

This chapter will describe the interview process and the conclusions drawn from these interviews.

For a full text version of the interviews, refer to appendix B.

2.1 Interview process

The interviews were meant to answer questions about potential interest in a content management

system, as well as to gather information about project specifics for later use in the learning

component of the system. For this purpose, we created the following set of questions:

• General explanation of the perceived system and what the functionality is.

• Would you be interested in such a system?

– If yes, what features are crucial to you? Are there any potential improvements you

can think of?

– If no, what features would we have to implement to make you interested?

• What are the first steps you undertake when starting a new project or paper?

• What kinds of files do you generate during your research? What are the specific contents

and extensions of these files? Are there any specific relations between these files? What

is the order of magnitude of the amount of files created or used in a typical project?

2.2 Interview conclusions 4

• Do you make use of a specific directory structure during a project and if so, what does it

look like?

• If you were to make use of this system, would you use it to upload singular files during

your research, or to upload the finished project as a reference book?

• How do you manage your references when writing a paper?

2.2 Interview conclusions

The following is a list of conclusions drawn from the interviews:

• There is a certain demand for a content management system within the faculty, as data

management is currently performed in an individual manner. This can make it hard to

find older research.

• Collecting and saving data can be a messy process. Adding some form of automation to

make it more standardized and structured would be welcomed, as long as the system is

easy to use.

• Version control of files is important.

• Some projects use confidential data; as such, there should be some form of authorization

and authentication to work with this data.

• Both proposed use cases (using the system dynamically during research and using it as

a reference book afterwards) were requested. The first because it eases the burden of

keeping up with a folder hierarchy (given enough automation and user friendliness), the

first because it makes it easier to find information about older projects.

• Both the types of files generated during a project and the directory structure vary widely.

• All people interviewed use EndNote for managing references; thus, integrating this software

would greatly improve user friendliness and usability.

CONTENT MANAGEMENT SYSTEMS: LITERATURE STUDY 5

Chapter 3

Content management systems:

literature study

This chapter will make a comparison of content management techniques, especially with regard

to possible database technologies. The benefits and downsides of multiple database types are

considered and applied to the context of a content management system. Some existing imple-

mentations will also be discussed and compared to our system, in order to clarify the decisions

made in our adaptation of a content management system.

3.1 Background information

A content management system (CMS) is a digital application that allows for creation, adaptation

and retrieval of digital content. CMSs are often equated to web content management systems

(WCMS), specifically designed for managing web content. This is not the purpose of this thesis,

however: our system is based on a digital asset management system (DAMS). In this type of

CMS, the focus lies on the optimal use, reuse, and repurposing of its assets or data [1]. Still,

WCMSs will also be taken into account in this chapter, as their architecture is often very similar

to the architecture of DAMSs.

For this master’s thesis, we will not discuss a graphical user interface or other front-end

components, as this is out of scope for our system. We will instead focus on the back-end

structure and database technology.

In most content management systems, the content itself is simply stored on a server. In order

to access this content, metadata about each file is stored in a database [2] [3]. The metadata

3.2 Database technologies 6

can be used to query the database, as well as locate the relevant files on the server.

3.2 Database technologies

This section will provide a quick overview of the different types of databases and will briefly

examine the strengths and weaknesses of each type. Additionally, it will present a more in-depth

view on graph database technologies, as this is the database type that the system ultimately

employed.

3.2.1 Relational databases

Relational databases have been the backbone of most data-oriented projects for a long time.

They excel at keeping a large amount of data persistent while using a standardized relational

model and query language, namely SQL (Structured Query Language). Relational databases

provide concurrent access through the use of ACID (Atomicity, Consistency, Isolation, Durabil-

ity) transactions [4]. This means the following:

• Atomicity: every part of a transaction must succeed, otherwise the transaction is aborted.

• Consistency: every transaction must end in a valid state for the database, i.e., consistent

with all relevant constraints.

• Isolation: concurrency control of the flow of transactions, i.e., the resulting system state

assumes that all transactions were performed sequentially.

• Durability: once a transaction has completed, the results of this transaction are stored

permanently.

Error handling is done by performing a roll-back of these transactions.

The largest downside of a relational database in the context of content management systems

is the object-relational impedance mismatch between the standardized relational data model

and the in-memory data structures [4]. Figure 3.1 shows the four layers of object-relational

impedance mismatch.

Another important downside of relational databases is that they do not run well on clusters

of computers [4]. This is relevant for big data applications, as scaling out (i.e., lots of smaller

machines) is less expensive than scaling up (i.e., machines with better hardware).


Figure 3.1: The four layers of impedance mismatch in relational dabase management systems

[4].

NewSQL

NewSQL is a relatively new category of data stores, providing solutions for better horizontal

scaling while retaining the relational database model. The technologies in this category usually

offer a relational view of the data, although the internal data representation can differ from

this[5]. This type of data store is ideal for applications with strong consistency needs (i.e., some

time-crucial applications) where the structure of the data is known and not likely to change.

The reward for working with a predefined data structure is very efficient and flexible querying

[6].

We have decided against using a relational database technology for this work, however, as

the structure of the data is not known beforehand. Additionally, the structure of projects will

often vary and can change over time (when more documents are added to an existing project).

3.2.2 NoSQL

NoSQL (Not Only SQL) is an umbrella term that includes a large amount of very diverse

database technologies that do not use the relational database model. The technologies that

belong to this class data stores vary widely, but most share the following characteristics [5]:

• Flexible data models: NoSQL data stores offer more flexible schemas and are sometimes

completely without schema.

• Enables large-scale data: NoSQL databases are designed to handle large amounts of data

in a distributed manner.


• Provides high availability: in light of the distributed nature of a lot of NoSQL database

systems, they achieve high availability through partition tolerance rather than consistency

(which is the case for relational databases).

• Does not usually rely on ACID transactions; instead, the transactions follow the BASE

(Basically Available, Soft state, Eventually consistent) principles. This indicates that the

database is as a whole is available even if parts of it are not and that it can tolerate

inconsistency for a certain amount of time, but it eventually settles into a consistent state.

There are four main architectural patterns in NoSQL databases.

Key-value store

A key-value store is a database that stores simple key-value pairs [4]. There is no query language;

instead, get-requests are performed by inputting a key and retrieving the corresponding value.

There are no restrictions on data types in this type of database. Figure 3.2 shows a comparison

of the architecture of a key-value store and a relational database. Key-value stores are often

used as a simple REST API with PUT, GET and DELETE inputs.

The benefits of a key-value store are its simplicity and scalability; most use cases of this type

of database require simple data patterns. Within the context of content management systems,

this might not be an optimal choice, as interrelations within the data can be numerous.

Figure 3.2: A comparison of the architecture of a key-value store and the architecture of a

traditional relational database [4].

Document store

Document store databases have a tree-like structure with a root, branches and leafs, as shown

in Figure 3.3. Each node in a document store maps to a unique path expression, consisting

of all the previous nodes up to the root. Each document store has a query language or API,

which is used to traverse these path expressions [4]. Document stores are very flexible, with a


wide variety of possible use cases. They are often used in content management systems where

interrelations within the data are simple.

Figure 3.3: The architectural pattern of a document store database [4].

Column family store

Column family stores are a form of key-value pairs, in which the key is a series of columns rather

than a singular value [4]. An example of a simple column store is shown in Figure 3.4, where

the key consists of two columns. Columns can be grouped together in column families. This

way, not all tables have to be read to respond to a given query, only the relevant ones. Column

family stores are made to support sparse data and work best on a cluster of servers. They are

extremely scalable, but are only worth using when working with very large sets of data.

Column family stores are a valid option for content management systems. Column family

stores could be used to model graphs, as they are made to support sparse data, but only when

using an extremely large amount of data.

Figure 3.4: An example of a very simple column family store with two columns as key [4].

3.3 Graph database technologies 10

Graph store

The architectural pattern of a graph store database consists of three unique data fields: nodes,

relationships and properties. Nodes can be linked by a relationship, while both nodes and

relationships can have properties. This pattern is visualized in Figure 3.5.

The benefits of a graph store lie in the relationships between nodes. It is easy and efficient

to analyse these relationships and to traverse through nodes that are linked to each other. As

such, graph store databases are a solid choice for handling highly interconnected data [5].

This focus on relations between data leads to a disadvantage too, however. In contrast

with other NoSQL architectures, the structure of graph stores is highly changeable, which leads

to volatile keys. Additionally, the data is queried by capitalizing on the relations between

nodes instead of lookups. This means that partitioning graph databases is significantly more

challenging than partitioning other NoSQL databases [7].

In this thesis, we decided to use a graph store database to store meta data about the used

content. This choice was made because in our system, the emphasis lies on the relationships

between the data files, and the dataset at our disposal is not large enough to warrant the use of

a column family store database. Additionally, graph stores are generally efficient when used in

recommendation-based systems [8]. This is relevant in our work, as Chapter 6 will explain that

we generate relations between graphs in a manner akin to recommendation systems.

Figure 3.5: The architectural pattern of a graph store database [4].

3.3 Graph database technologies

This section will go over some existing implementations of a graph store, as this is the architec-

tural pattern that we use in this master’s thesis. At the time of writing, there are three database

technologies that are more popular than their counterparts:

• Neo4j: currently the most prevalent graph database [9] [10].

3.4 Back-end technologies 11

• Titan: a scalable graph database, optimized for modeling extremely large amounts of data

[11]

• OrientDB: a multi-model database, combining elements of a graph store and a document

store [12].

In 2013, Jouili and Vansteenberghe published a paper comparing the performance of these

graph database implementations [13]. They used graph blueprints to test this in a completely

implementation-agnostic way. Their system tested the performance of each database under three

types of workloads:

• Load workload: in this type of workload, vertices and edges are progressively added to the

graph.

• Traversal workload: this type of workload consists of shortest path and neighbourhood

exploration assignments.

• Intensive workload: multiple users concurrently send basic requests to the database.

When testing the load workload, it was found that Neo4j performed better than both Titan

and OrientDB for graphs with under three million vertices. Neo4j also got the best results in

both the shortest path search and neighbourhood exploration tests, regardless of the maximum

hop size used. When testing under an intensive workload, OrientDB outperformed both Neo4j

and Titan. The difference in throughput was marginal, however.

Because the aforementioned results and because of prior experience with the technology, we

have opted to use Neo4j for our system.

3.4 Back-end technologies

Our system uses Java Spring as back-end technology. There are two major reasons for this

choice:

1. Spring contains a Spring Data Neo4j library. This library provides access to a Neo4j

database, including object mapping, transaction handling, etc. The interaction with the

database happens through Cypher, Neo4j’s query language [9].

2. We have previous experience in working with Java Spring. As such, this choice gains us

more time to work on the learning components of the system.

AUTOMATED DOCUMENT TAGGING: LITERATURE STUDY 12

Chapter 4

Automated document tagging:

literature study

This chapter will inform the reader about the different possible techniques available in the field

of automated document tagging. Additionally, it will compare these techniques and provide

information as to which ones are most applicable to the system at hand.

4.1 Background information

In computer science, document tagging is the problem of automatically attaching certain descrip-

tive words, or tags, to a certain document. These tags should make it possible for the reader to

gauge what the document is about at a glance, without investing any significant amount of time.

Additionally, this makes it easier to classify documents and data, as heterogeneous datasets can

easily be queried based on these tags.

Document tagging can be done individually, but the biggest perks come with using tags in

a multi-user context. Some problems arise in these situations, however. Golder and Sun cited

four different types of problems when it comes to document tagging in a setting with multiple

users[14][15]. These problems are the following:

• Polysemy: the same word or tag can have multiple meanings, depending on the context

and the author.

• Synonymy: multiple words or tags can have the same meaning.

• Level variation: tagging can happen on multiple abstraction levels, meaning tags are not

4.2 Technical overview 13

always comparable.

• Spelling: there can be differences in spelling, be it through spelling mistakes or regional

differences in language, i.e., English and American English.

These issues showcase the need for standardized tagging rules. One solution, in the field of

computer science, is to automatically generate or suggest certain tags. In doing so, the problems

mentioned above are mostly resolved.

4.2 Technical overview

In automated document tagging, machine learning techniques are used to label documents with

certain tags. Generally, this is done in two steps: the first is feature extraction and selection,

and the second is the actual document classification.

Additionally, relevant tags can also be extracted from similar documents or other documents

from the same user. In this approach, collaborative filtering is used to recommend tags. Note

that in this approach, new tags cannot be generated; only tags that have already been used in

the data set can be recommended.

4.2.1 Feature extraction and selection

Feature extraction and selection in the field of document tagging can be divided into three main

categories:

• Content-based methods

• Structure-based methods

• Graph-based methods

In content-based methods, the actual content of a document is used to extract features. This

usually translates to extracting word frequencies or semantic relations from the body of text.

This method is mostly used in text-heavy document types [16][17].

Structure-based methods utilize the structure of a document to compose a feature vector.

Naturally, this method is commonly used to extract features from document types with an

emphasis on structure, like XML or HTML [18].

Graph-based methods aim to make a graph in which the elements are any combination of

users, words, documents and tags. In such a graph, the relationships between the multiple nodes


are used to select features for classification. Algorithms where user behaviour is considered are

coined personalized methods. Figure 4.1 shows an example of a graph consisting of users, tags

and documents as nodes.

Figure 4.1: An example of a tripartite graph used in graph-based methods. The nodes here

consist of users, tags and documents [19].

4.2.2 Document classification

After extracting relevant features (mostly in content-based and structure-based approaches), the

documents still have to be classified. The type of model used for classification largely depends

on the problem (and data) at hand: if the data is mostly linearly separable, a simple (linear)

model will be enough. If not, a more complex model is necessary. Figure 4.2 shows an example

of linearly separable data and non-linearly separable data. In literature, the three most common

classifier types we encountered are support vector machines (SVM’s), decision trees (or random

forests using decision trees) and neural networks [16][17][20]. In the following subsections the

three classification types that were used in this work (SVM, logistic regression and random

forest), will be explained more thoroughly.

Support vector machines

Support Vector Machines (SVMs) will be explained based on the paper by Tong et al [21]. In the

following section we presume the case of a binary classification problem. It is straight forward

how SVMs work when there are more than 2 classes. For the mathematical principles we refer

to the paper itself.

In their simplest form, SVMs are hyperplanes that separate the training data by a maximal

margin, as seen in Figure 4.3. All points lying on one side of the margin belong together and


(a) Linearly separable data (b) Non-linearly separable data

Figure 4.2: Linear versus non-linear separable data

Figure 4.3: Linear support vector machine.

are part of the same class. The functionality of SVMs is based on support vectors. These are

the points that lie closest to the hyper plane. Instead of calculating the hyperplane over all data

points it is only done over the support vectors. SVMs allow to transform the training data into

a higher dimension and therefore a hyperplane will always be found. This can be done in several

ways with the use of kernels. An example can be seen in Figure 4.4.

Random forests

According to the definition of Leo Breiman: A random forest is a classifier consisting of a

collection of treestructured classifiers hpx,akq, k � 1, ...where the ak are independent identically

distributed random vectors and each tree casts a unit vote for the most popular class at input

x[22]. In other words, a tree is a random vector which is independently generated form all other

trees, but with the same distribution. After all trees have been generated they make a majority

vote for the most popular class. An important characteristic is that for a large number of trees

the result converges. This follows form the Strong Law of Large Numbers [22].


Figure 4.4: Non linear support vector machine

First, let us elaborate more on the individual decision trees. Two types of trees can be

distinguished: classification trees and regression trees. As we are only interested in classifying

data, we will only focus on the first type of tree. Trees can be build according to several

algorithms: CHAID (CHi-squared Automatic Interaction Detector), CART (Classification And

Regression Trees), MARS (Multivariate Adaptive Regression Splines), etc. In this dissertation,

trees are built following the CART principle. CART analysis is a form of binary recursive

partitioning: every tree can be seen as a series of nodes, where each node represents a binary

decision. The first node is called the root node, while nodes that do not split any further are

called leafs[23]. Figure 4.5 shows a binary tree with four leafs.

Root

Leaf

Leaf

LeafLeaf

Figure 4.5: An example of a binary tree.


Growing decision trees is a greedy top-down procedure. The procedure starts at the root

node and progressively splits the data in smaller and smaller subsets. Splitting is done on a

chosen attribute (feature). This attribute is chosen so that the split makes the data as pure as

possible (in its most purest form we can assign one class to the subset of data.) To make the

most optimal choice for the attribute we can measure the misclassification impurity iptq:

iptq � 1 �maxj

P pCj | Nq (4.1)

Where P pCj | Nq is the fraction of the training data in category Cj that ends up in node N. For

iptq several functions can be used here we will only discuss the Gini impurity. Gini uses for i(t):

iptq �¸

i�j

ppCi | NqppCj | Nq (4.2)

which gives the expected error rate at node N if the category node is selected randomly

1

2r1 �¸

j

p2pCj | Nqs (4.3)

Resulting in the fact that the Gini algorithm will search for the largest class and isolate is form

the rest of the data (purest data ). It is proven that Gini works well with noisy data[24].

To prevent over fitting caused by noise in the data or by fully growing a tree, the lower splits

are decided based on less data and therefore might be poor splits, meaning we can prune the

tree. There are 2 option: pre pruning, stop growing the tree when there is insufficient amount of

data to make the split, and post pruning, fully grow the tree first and then remove sub trees. In

CART, Post pruning is chosen because this often leads to the best results. We will not discuss

these methods any further as random forest classification uses fully grown trees.

Before going further with random forest we will briefly go over the time complexity of a

decision tree. If we indicate our amount of data points with N and our dimensionality (the

amount of features) with D the runtime complexity is Oplog2Nq and our training complexity is

OpDN2log2Nq[24]. It is clear that training a tree is an expensive task but running a test is not.

Random forest creates an ensemble of many unpruned trees. According to Breiman random-

ness is very important [22]. By injecting the right amount of randomness for each tree one can

minimize the correlation between trees while maintaining the strength. This is done by ran-

domly selecting features or feature combination at each node to make the split. This procedure

has the following advantages:

1. Its accuracy is as good as Adaboost and sometimes better.


2. It is relatively robust to outliers and noise.

3. It is faster than bagging or boosting.

4. It gives useful internal estimates of error, strength, correlation and variable importance.

5. It is simple and easily parallelized.

By using bagging in combination with random feature selection one can minimize the cor-

relation. Each new training set is then drawn, with replacement from the original training set.

Then a tree is grown on this new training set using the random feature selection. Important

is that the trees are not pruned, which is not necessary because of the bagging and weak cor-

relation between trees. In general approximately one-third of the total training set is left out

for each new training set. This unused one-third (the out-of-bag data) can be used for valida-

tion. Bagging is used because it improves accuracy and because it can give an estimate on the

generalization error, the strength and the correlation for the combined ensemble[22].

Trees are grown according the earlier explained CART methodology, where the split at each

node is done by a random group of features. Trees are fully grown and not pruned. As the split

is done on a random subset of features this results in faster training OpDlog2N{M with M the

number of input variables, and minimizes the inter-tree dependencies.

Logistic regression

Logistic regression is a mathematical model. The model is used to predict the outcome of an

event based on one or more features. It arises from the desire to model the posterior probabilities

of the K classes via linear functions in x, while at the same time ensuring that they sum to one

and remain in [0, 1] [25]. Classifying is done by minimizing an entropy function. For a more

complete view on the mathematics behind logistic regression, we refer to Hastie et. al [25].

4.2.3 Collaborative filtering

Recommender systems are filtering systems that attempt to predict the value or rating that

a certain user would give a certain item. In recommender systems, collaborative filtering is

a technique in which available information about similar users or items is used to make this

recommendation [26].

Collaborative filtering can either be user-based or item-based. In the former version, the

system searches users who share the same rating patterns as the user at hand and then uses

4.3 Solutions in literature 19

information about these users to calculate a prediction for the active user. In the latter, a similar

approach is taken with items: first similar items are found, then information about these items

is used to make a prediction. Similarity between two items can be calculated in different ways

depending on the application [27].

When one replaces the numeric value or rating with a tag, collaborative filtering can also be

used for automated tag recommendations. There is a certain amount of background information

necessary about either the user or the document (or both) for this technique; as such, this

technique is most applicable in graph-based methods, where such information is kept in the

form of graph relations. It is worth noting that no new tags can be recommended using this

approach, as all recommendations are based on tags that have been added to the system in the

past.

4.3 Solutions in literature

In this section, we will give an overview of specific methods found in recent literature. First,

in subsection 4.3.1, single-label classification will be discussed. In subsection 4.3.2 we will then

discuss some solutions for adding multiple labels to a single document.

4.3.1 Single-label learning

Content-based

Choudhary presented a content-based method for clustering text using a Universal Networking

Language (UNL) to capture semantic relations between words. By using this semantic rep-

resentation, Choudhary achieves better clustering results when compared to a bag of words

representation using word frequencies [17]. In the UNL representation, the relations between

words in a sentence are visualized in a graph structure, as depicted in Figure 4.6. In this

method, classification is done by using a neural network based technique called Self Organizing

Maps (SOM), in which similar documents are mapped to the same nodes.

Structure-based

Garboni et al. published a data-mining method for clustering XML documents using only

structure-related information [18]. This data-mining algorithm searches documents for structural


Figure 4.6: An example of a sentence in the UNL representation [17].

pattern similarities which the author coined frequent sub-trees. An example of such a tree is

shown in Figure 4.7. Classifying happens in three different steps in this method:

1. Remove irrelevant XML tags.

2. Define the classes by extracting multiple frequent sub-trees from the training set.

3. Classify new documents in one of the defined classes by using a custom distance measuring

algorithm.

Figure 4.7: An example of a frequent sub-tree in two XML documents [18].


Graph-based and collaborative filtering

Lipczak proposed a tag recommendation system for folksonomies catered to individual users

[28]. This graph-based method classifies documents in three steps. In a first step, basic tags are

extracted from the title of a document. These tags are chosen based on a score for each word in

the title, which is equal to the amount of times this word has been chosen as a tag divided by

the number of occurrences. The second step serves to make a lexicon of all the possibly related

tags for the document at hand. This lexicon can be built in two ways:

1. Add other tags from the same resource to the lexicon.

2. Tags closely related to the tags extracted from the title.

In the third step, the list of tags extracted in step 2 is narrowed down. The system uses person-

omy based filtering to accomplish this: tags closely related to the user (or similar to tags the

user has used before) will get higher priority.

Jäschke et al. created another graph-based tagging algorithm for folksonomies called FolkRank

[29]. In this method, a graph is constructed with users, tags and documents representing nodes.

When recommending a tag, the available tags are given a weight according to a ranking algo-

rithm like PageRank, based on the relations in the created graph. This algorithm suffers from

a problem, however, as the PageRank algorithm will always jump to the most globally popular

tags given enough iterations. Liu et al. proposed a method to fix this issue in the FolkRank

method, called FolkDiffusion [16]. This algorithm uses a ranking algorithm based on the physics

of heat diffusion rather than PageRank, overcoming the problem of topic drift.

Mishne created a tag recommender for webblog posts called AutoTag [30]. This algorithm

uses a collaborative filtering approach for tag recommendation, in which the blog posts them-

selves take on the role the user would have in a standard recommender system. The tags are

considered to be the items that can be recommended to the users. The system then follows a

traditional recommender system approach, which means tags assigned to blog posts similar to

the active blog post are recommended. The information flow is depicted in Figure 4.8.Similarity

is measured by querying an information retrieval system that has access to a large collection

of blog posts. Queries are generated from the active post in several ways, the most effective of

which was searching based on the most distinctive terms in the post.


Figure 4.8: The flow of information that results in a recommendation in the AutoTag algorithm

[30].

Tatu et al. also proposed a graph-based solution to the tag recommendation problem, more

specifically on bookmark data [31]. In this method, a certain document is identified by a

triple including the bookmarked content, the user, and tags given to the bookmarked content.

Additionally, relevant meta-data is also added to a bookmark. Natural language processing

tools like WordNet are used to create well functioning feature vectors by stemming concepts

and linking synonyms. Similar tags are then grouped together in conflated tags, avoiding the

polysemy, synonymy and spelling problems mentioned in the introduction of this chapter. Both

existing tags and new tags can be recommended using this algorithm: the former by distance

measuring between similar users or documents and the latter by measuring weights of words

used in the content of the bookmark.

4.3.2 Multi-label learning

In multi-label learning problems, each document can have multiple labels or tags. The possibility

for multiple labels introduces a new challenge, as there might be some dependencies in the set

of possible tags or labels. Figure 4.9 shows an example of such a problem, wher

A framework for automated document storage and …...schema, but compensate with very efcient and...

Documents

Transcript of A framework for automated document storage and …...schema, but compensate with very efcient and...