A framework for automated document storage and …...schema, but compensate with very efcient and...
Transcript of A framework for automated document storage and …...schema, but compensate with very efcient and...
-
Jonas Busschop, Tim Gaspard
A framework for automated document storage and annotation
Academic year 2016-2017Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Herwig BruneelDepartment of Telecommunications and Information Processing
Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of
Counsellors: Dr. ir. Mike Vanderroost, Dieter AdriaenssensSupervisors: Prof. dr. ir. Antoon Bronselaer, Prof. dr. Guy De Tré
-
Preface
This work is the result of a year long process in which we received the help of several people.
First of all, we would like to thank our supervisors, Prof. dr. ir. Antoon Bronselaer and Prof.
dr. Guy De Tré, as well as our counsellors, Dr. ir. Mike Vanderroost and Dieter Adriaenssens.
They gave us the guidance we required to bring this work to a successful conclusion, while still
allowing us the freedom to solve any problems with our own creativity.
Additionally, we would like to thank our family for their constant support, not only during
our dissertation year, but throughout our entire education. Jonas would also like to thank his
Dinnerclub and dormitory friends, for making his final year in university an enjoyable one.
Jonas Busschop & Tim Gaspard, June 2017
-
Usage permission
“The authors give permission to make this master dissertation available for consultation and to
copy parts of this master dissertation for personal use.
In the case of any other use, the copyright terms have to be respected, in particular with regard to
the obligation to state expressly the source when quoting results from this master dissertation.”
Jonas Busschop & Tim Gaspard, June 2017
-
A framework for automated document
storage
and annotationBy
Jonas Busschop
&
Tim Gaspard
Master’s dissertation submitted to obtain the academic degree of
Master of Science in Computer Science Engineering
Academic year 2016–2017
Supervisors: Prof. Dr. Ir. A. Bronselaer, Prof. Dr. Ir. G. De Tré
Counsellors: Dr. Ir.M. Vanderroost, D. Adriaenssens
Department of Telecommunications and Information Processing
Chair: Prof. Dr. Ir. H. Bruneel
Faculty of Engineering and Architecture
Ghent University
Abstract
In university research groups, document management is usually performed in an unstandard-
ised manner, usually on an individual basis. Combined with staff turnover and the resulting
disorganisation in their research, this makes it especially hard to retrieve old information and
projects. In this work, we propose a framework for document storage with a focus on relations
between files, in order to make research data easily available even after an extended period of
time. Additionally, the framework has learning components that automatically connect it to
other data in the database and give it an annotation. In this manner, this work tries to provide
a solution for the data management problems in research groups with a high regard for usability.
Keywords
Graph database, automated document tagging, content management system, machine learning,
collaborative filtering
-
A framework for automated document storage andannotation
Jonas Busschop, Tim Gaspard
Supervisor(s): Prof. dr. ir. Antoon Bronselaer, Prof. dr. Guy De TréCounsellors: Dr. ir. Mike Vanderroost, Dieter Adriaenssens
Abstract—In university research groups, document manage-ment is usually performed in an unstandardised manner, usuallyon an individual basis. Combined with staff turnover and theresulting disorganisation in their research, this makes it especiallyhard to retrieve old information and projects. In this work, wepropose a framework for document storage with a focus onrelations between files, in order to make research data easilyavailable even after an extended period of time. Additionally, theframework has learning components that automatically connectit to other data in the database and give it an annotation. Inthis manner, this work tries to provide a solution for the datamanagement problems in research groups with a high regard forusability.
Keywords—Graph database, automated document tagging,content management system, machine learning, collaborativefiltering
I. INTRODUCTION
Document management is an increasingly difficult thing toimplement in university research groups. The amount of datais ever increasing and staff turnover is a very relevant factor,as this leads to large amounts of often unstructured data thatis difficult to process. As such, there is a need for a clearand organized way to deal with this data, so that it can beincorporated in future research.
Content management systems offer a solution to these kindsof problems. However, data belonging to university researchgroups has some special requirements when it comes tomanagement: more often than not, there is no standardizedprocedure to handle or structure data. Most of the decisionsmade pertaining to data management is made on an individualbasis. This leads to a very heterogeneous data structure:differences in folder structure, used programmes and file types,document annotations, etc.
This work introduces a content management system thatcreates relations between documents and annotates them inan automated way, in an effort to provide a solution tothe aforementioned problems. When a file or document isuploaded, it will automatically be given a position in a graphdatabase, based on its relations to other files of the sameproject. This procedure standardizes the structure of the data,making it easier to retrieve relevant information. Additionally,the files are given an annotation in an automated way.
One of the challenges of this solution is the heterogeneousdata: because of the widely varying types of documents, it isvery hard to use the file contents as features. As such, we have
provided some other solutions: we have used file extensionsand meta data to represent documents. Additionally, we havetested adding the content of PDF files to the feature vector, asthis was the most prominent file type in the given data sets.
In order to place a given file in the existing data structure,the system assigns a parent to this file. This parent is decidedby a method based on collaborative filtering: the system findsthe most similar node in different projects, then calculates whatnode in the current project is most similar to that node’s parent.
Any given document uploaded to the system is classifiedin one of five classes: Article, Data, Figure, Presentation,or Meta. The classification was tested using three differentclassifiers: support vector machines (SVM’s), random forests,and logistic regression.
The remainder of this article is structured as follows. SectionII discusses the related work in the fields of content manage-ment systems and automated document tagging. Section IIIdescribes the methodology used in creating the system. SectionIV describes the results achieved in the learning components ofthe system. Finally, Section V provides an overall conclusionfor this work.
II. RELATED WORK
A. Content management systems
The type of content management system created in this workis a DAMS (Digital Asset Management System), where thefocus lies on optimal use, reuse, and repurposing of assets ordata [1]. In most content management systems, the contentis simply stored on a server. In order to access this content,metadata about each file is stored in a database [2] [3]. Themetadata can be used to query the database, as well as locatethe relevant files on the server.
B. Database technologies
1) Relational databases: Relational databases excel atkeeping large amounts of data persistent while using a stan-dardized relational model and query language, namely SQL(Structured Query Language). They provide access to datathrough the use of ACID (Atomicity, Consistency, Isolation,Durability) transactions [4]. They require knowledge of thestructure of data beforehand, as they make use of a standard-ized schema. Traditional relational databases do not scale wellhorizontally.
-
NewSQL is a relatively new category of data stores, pro-viding solutions for better horizontal scaling while retainingthe relational database model. They still require a predefinedschema, but compensate with very efficient and flexible query-ing [5].
2) NoSQL databases: NoSQL (Not only SQL) is an um-brella term that includes a large amount of very diversedatabase technologies that do not use the relational databasemodel. The largest difference with relational databases withrespect to this work is that they either have a flexible schemaor no mandatory schema at all [4]. There are four classes ofNoSQL database technologies:
1) Key value stores2) Document stores3) Column family stores4) Graph stores
In this work we use a graph store, as this type of databaseexcels when working with highly interconnected data [4].
C. Automated document tagging
In computer science, automated document tagging is theproblem of automatically attaching certain descriptive words,or tags, to a certain document. These tags should make itpossible for the reader to gauge what the document is aboutat a glance, without investing any significant amount of time,or to query the data. Machine learning techniques are used tolabel data with one or more tags. Generally, this is done intwo different steps.
1) Feature extraction and selection: features are extractedin one of three ways in automated document tagging sys-tems: based on the content, based on the structure, or basedon relations between meta data (graph-based). Content-basedfeature extraction is mostly used in text-heavy documents[6] [7]. Conversely, structure-based feature extraction is morecommon when working with documents that have an emphasison structure (such as XML or HTML) [8].
2) Document classification: the second step consists ofusing the selected features to classify documents in one ormultiple annotation classes. The model used for classificationheavily depends on the problem and data at hand, but themost commonly used classifiers in literature are support vectormachines (SVM’s), decision trees (or random forests usingdecision trees) and neural networks [6] [7] [9].
Another way to classify the data is to make use of collabora-tive filtering. This is a technique often used in recommendersystems, in which available information about similar usersor items is used to make recommendations or decisions forthe current user or item [10]. Collaborative filtering can eitherbe user-based or item-based. In the former version, the systemsearches users who share the same rating patterns as the user athand and then uses information about these users to calculate aprediction for the active user. In the latter, a similar approach istaken with items: first similar items are found, then informationabout these items is used to make a prediction.
III. METHODOLOGYA. Back-end structure
The structure of the data we are working with is notpriorly known. Additionally, the structure of different projectscan vary immensely and can always be changed (i.e., whenadditional documents are added). For these reasons, workingwith relational databases and their structured schemas is notdesirable.
In our system, the emphasis lies on the relationships be-tween the data files, and the dataset at our disposal is not largeenough to warrant the use of a column family store database.As such, we have opted to use Neo4j, an open source graphdatabase [11].
B. Feature selection
The first features we use are the file extensions of thedocuments in the data set. The information about a document’sextension is represented as an n-dimensional vector, where nis the total amount of distinct extensions in the dataset. Thevalues in this vector are binary, i.e., 0 of the file does not havethis extension and 1 if it does.
The second feature we have used is one we call filedepth.This is an attempt to put a value on the position of a documentin the folder structure. This position is considered as relativeto the document highest up in the folder hierarchy. This meansthe documents located closest to the root folder are given afiledepth value of zero. The filedepth values of documentslocated further down the folder hierarchy are incrementedfor each additional subfolder or hierarchy layer between saiddocument and the root folder.
As PDF files where the most prominent ones in the dataset,we have also tested features that represent the content of thesefiles. In order to do this, we have used the Python libraryPDFMiner [12]. A PDF file can be seen as different buildingblocks that are placed in the document. These blocks can havedifferent types of content, e.g., text, images, drawings, etc.PDFMiner extracts a value for ten different types of boxes,depending on the frequency of occurrence in the document.The normalized versions of these values are used as features.
Lastly, we use information about a file’s parent whenannotating the documents. This information is available, asthe placement in the graph database happens before theclassification does. The information that we use is the fileextension of a file’s parent, represented in the same way as itsown extension.
C. Automated graph placement
As mentioned earlier, the automated graph placement isperformed by selecting a parent node for the current file.This problem is not well suited for regular machine learningclassifiers, as the amount of possible classes (i.e., everydocument in a given project) would exceed the feature vectordimensionality. Additionally, the classifier would have to beretrained every time a single document was added to a project.In order to avoid these problems, the system uses an algorithmbased on collaborative filtering. Three different steps can beidentified in this approach:
-
TABLE I: The success rate of the parent selection algorithm when using only file extensions in the feature vector.
1 trainset 2 trainsets 3 trainsets 4 trainsets 5 trainsets 6 trainsets 7 trainsets random modelLaTeX project 1 20% 18% 22% 20% 20% 20% 12% 11%LaTeX project 2 20% 21% 25% 20% 25% 17% 19% 19%LaTeX project 3 13% 57% 55% 55% 65% 63% 63% 38%LaTeX project 4 30% 33% 37% 26% 41% 30% 41% 34%LaTeX project 5 9% 55% 54% 61% 22% 27% 26% 24%Bio-Engineering project 1 94% 82% 94% 84% 94% 65% 53% 53%Bio-Engineering project 2 34% 34% 38% 43% 36% 28% 10% 34%Bio-Engineering project 3 25% 25% 25% 28% 25% 24% 23% 27%
1) Find the node in the existing database with the highestsimilarity to the working node.
2) Find the parent of this node.3) Find the node in the working project with the highest
similarity to this parent node.
In order to find the most similar nodes, every node or file isrepresented by a feature vector as described in Section III-B.Then, the cosine similarity is calculated between the relevantfeature vectors.
In regular collaborative filtering applications, the solutionfor the most similar data point is chosen for the working datapoint as well. This is not applicable for this problem, however,as a node from another project can never be the parent node ofthe working document. As such, a second similarity measureis made between the parent node of the most similar node andall the files in the current project. This ultimately results in aparent node for the current document.
D. Document classification
We classify each file in one of five classes, and annotateit with the corresponding tag. These classes are Article, Data,Figure, Presentation and Meta. Three different classifiers weretested: an SVM classifier, a random forest classifier and alogistic regression classifier. Since the feature vector is quitesmall and the feature space is not very complex, the classifierswere not made too complex. We opted for a linear kernel forthe SVM classifier and a small amount of trees (50) for therandom forest classifier. Python’s Sklearn library was used toimplement these techniques.
IV. RESULTS
A. Training data
The data used to train on in this work consists of two distinctdata sets. The first is a collection of papers written in LATEX,the second is a collection of bigger projects provided by thefaculty of Bio-Engineering at Ghent University. It is worthnoting that the first data set is well structured, while the secondone is not organized.
Testing has always been done in the same way, both forparent selection and type classification. Every method andfeature vector has been evaluated in 7 stages. We started withone project, then incrementally add projects until all are usedfor training.
B. Parent selection
Three different feature vectors were used to test the successrate of our algorithm. In a first iteration, a document was rep-resented by a feature vector containing only its file extension.A second feature vector was made out of file extensions andthe filedepth feature. Finally, in a third test, the contents ofPDF files were added to the feature vector.
1) File extensions only: The results of the first test iterationare shown in Table I. The random model percentage wasdecided as picking the project node as parent every time,meaning this percentage is the relative amount of times theproject node was selected as a parent. With the exception ofthe last two projects, the algorithm is a definite improvementover the random model. The 2 final Bio-Engineering projectsperform worse here, as they are very different from the otherprojects in terms of file type composition. The first Bio-Engineering project still performs well because it is very smalland the project node is the predominant parent.
Two additional phenomena can be seen in this table:1) Some projects cause big fluctuations in percentage for
one another, while others do not affect the results at all.2) More datasets to train on does not necessarily mean a
better result; indeed, in some cases the percentage of cor-rect parent predictions drops after adding an additionaltraining set.
These phenomena are caused by the nature of the algorithm:the presumption is made that a similar node will have a similarparent, which is not always true. This means that a better resultwill be achieved when similar data sets or projects are usedfor training, rather than just more data sets or projects.
2) File extensions and filedepth: In the second testingiteration the previously explained filedepth feature was addedto the feature vector. The average difference in success rateper project is portrayed in Table II. In some projects, addingthe filedepth feature causes a large difference. In others, thedifference in success rate is minimal. When some documentstructure is present in the project, filedepth will have a largeimpact. If the document structure of the training sets and thatof the test set are similar, this impact will be a net positive.If the document structures are not alike, however, the impactwill be negative.
3) File extensions and PDF content: In the final test run,PDF content was used in the feature vector along with fileextension. The average difference in success rate per projectis shown in Table III. These differences are not very large.However, in some individual cases, larger fluctuations insuccess rate can be seen (6-10%). These fluctuations are
-
TABLE II: The average difference in success rate per projectwhen comparing a feature vector with just file extensions anda feature vector containing both file extensions and filedepth.
MeanImprovement
LaTeX project 1 5%LaTeX project 2 28%LaTeX project 3 -26%LaTeX project 4 27%LaTeX project 5 -2%Bio-Engineering project 1 4%Bio-Engineering project 2 2%Bio-Engineering project 3 -3%
caused by a mismatch in parents of PDF files: the results areimproved in cases where similar PDF files with similar parentsare present in the training sets, but are outnumbered by pdffiles with different parents.
TABLE III: The average difference in success rate perproject when comparing a feature vector with just file ex-tensions and a feature vector containing both file extensionsand PDF content.
AverageImprovement
LaTeX project 1 2%LaTeX project 2 0%LaTeX project 3 2%LaTeX project 4 1%LaTeX project 5 0%Bio-Engineering project 1 0%Bio-Engineering project 2 -1%Bio-Engineering project 3 1%
C. Document classification
1) Comparison of SVM, logistic regression and randomforest: As mentioned earlier, three different classifiers weretested: a support vector machine classifier, a random forestclassifier, and a logistic regression classifier. To test whichone works best for our specific problem, we have used thelargest project at our disposal (Bio-Engineering project 3) totrain the classifiers with, and have used the others as a testset. The results of this test are shown in Figure 1.
When using very simple feature vectors, the results of allthree classifiers is similar. When the feature vectors get morecomplex, however, the SVM classifier performs better than theother two. As such, we will be using this classifier for futureresults.
To get an estimate of the difficulty to assigning tags to thedata, a random model has been made. The random model isbased on the type that occurs most in Bio-Engineering project3. The percentage of this type in the projects that we test willthan be the accuracy of this model for that project. Table Vshows the amount of tags that would be correctly predictedaccording to this model. Because the first 5 test sets containalmost zero files with the same tag, the random model has apoor performance. The model performs better for the next 3datasets. From these percentages we can see that in general it ishard to predict the tags for the dataset. Even for a dataset thatresembles the training set the amount of correctly predictedtags is under 50%
(a) Extension only.
(b) Extension and parent extension
(c) Extension, parent extension and pdf content
Fig. 1: Accuracy of Type prediction when training withBiogenic amines part II
2) File extensions only: Table IV shows the predictionaccuracy when training and testing an SVM classifier withonly the file extension as feature vector. Here we can see thatfor the first 5 test sets decent results are obtained. The accuracywith a single training set is bigger than 60%. Training withmore than 1 training set even results in an accuracy that is over85% for all test sets. These high results can be explained bythe fact that train and test projects are similar. The other threeprojects, that don’t resemble the first 5 training sets obtaina far lower accuracy. For Bio-engineering project 1 and 2,with a large amount of training sets, a decent result is stillachieved (> 60%). The Bio-Engineering project 1 has a verylow accuracy even for large amount of training sets. This isdue the fact that it is a very small project with file extensionsthat do not often occur in the training sets.
3) File extension and parent extension: Table VI shows theaverage difference in success rate between training a featurevector on file extension only and a feature vector with fileextension and parent extension. Although the difference issmall, the mean difference is almost negative for every project.
-
TABLE IV: The success rate of the document classification algorithm when using only file extensions in the feature vector.
1 trainset 2 trainsets 3 trainsets 4 trainsets 5 trainsets 6 trainsets 7 trainsetsLaTeX project 1 86% 86% 86% 86% 86% 86% 90%LaTeX project 2 63% 89% 89% 90% 90% 95% 95%LaTeX project 3 89% 100% 100% 100% 100% 100% 100%LaTeX project 4 85% 100% 100% 100% 100% 100% 100%LaTeX project 5 68% 87% 87% 87% 90% 90% 90%Bio-Engineering project 1 12% 12% 12% 12% 29% 29% 29%Bio-Engineering project 2 30% 38% 38% 38% 67% 67% 72%Bio-Engineering project 3 41% 55% 55% 55% 61% 57% 60%
TABLE V: Random model for Bio-Engineering project 3
Random modelLaTeX project 1 0%LaTeX project 2 4%LaTeX project 3 0%LaTeX project 4 0%LaTeX project 5 4%Bio-Engineering project 1 47%Bio-Engineering project 2 41%Bio-Engineering project 3 37%
This means that adding the parent extension to the featurevector doesn’t increase the accuracy of the model.
The main reason for this is that the data that is available fortraining is of a limited amount and very homogeneous withineach project, meaning that the relationship between parent andchild is very specific for each project. Research should beperformed on bigger and more heterogeneous data sets.
TABLE VI: The average difference in success rate per projectwhen comparing a feature vector with just file extensions anda feature vector containing both file extensions and parentextension.
Mean ImprovementLaTeX project 1 0%LaTeX project 2 1%LaTeX project 3 -1%LaTeX project 4 -1%LaTeX project 5 1%Bio-Engineering project 1 -5%Bio-Engineering project 2 -3%Bio-Engineering project 3 -4%
4) File extension, parent extension and pdf content: Nowwe will add the pdf content to the feature vector and comparethis to the results obtained from the feature vector based ononly the file extension. Table VII shows the results. Here again,in general, the performance is worse than for file extensiononly.
In general this means that there is little difference in usingfile extension, parent extension and PDF content or only fileextension and parent extension. This leads to the conclusionthat there is no benefit in using the PDF content in its currentform. Looking at how the PDFminer library works and howwe constructed our feature vectors, there is no difference inweight between a single character in LTChar or LTAnno anda figure in LTFigure or LTImage or a drawing in LTDrawing.This leads to features that are nearly always heavily biasedtowards text areas, even for image-heavy PDFfiles.
5) Influence of training size: Figure 2 shows the graphfor the data of the Bio-Engineering project 2. It shows the
TABLE VII: The average difference in success rate perproject when comparing a feature vector with just file ex-tensions and a feature vector containing both file extensions,parent extension and pdf content.
Mean ImprovementLaTeX project 1 0%LaTeX project 2 -1%LaTeX project 3 -1%LaTeX project 4 -1%LaTeX project 5 1%Bio-Engineering project 1 +3%Bio-Engineering project 2 -3%Bio-Engineering project 3 -3%
importance of a large and diverse training set. Contrary to theparent selection algorithm, more training data nearly alwaysleads to a better performance here.
Fig. 2: Relevance of dataset size for training
V. CONCLUSION
In this work we created a framework for document storagein university research groups, using a graph database to putthe focus on the relations between documents.
A learning component was implemented to automaticallyassign a parent node to a new document uploaded to thisframework. There is a clear correlation between file extensionand finding the correct parent node. Adding more complexfeatures gave varying results; as such, future work should focuson finding better features and making a selection process forwhich data sets to use for training, as more data does notalways improve results.
Additionally, the documents added to the framework weregiven an annotation in an automated way. Out of the threeclassifiers tested, the SVM classifier provided the best results.As expected, there was a very big correlation between a
-
document’s annotation and its file extension. More complexfeatures did not usually improve success rates, however. Futureresearch in using the contents of prominent file types in aweighted way may improve results.
REFERENCES[1] F. Frey, Digital asset management—a closer look at the literature. PhD
thesis, Rochester Institute of Technology, 2004.[2] T. Gill, “Digital media asset management system and process,” Sept. 20
2005. US Patent 6,947,959.[3] K. C. Jones, C. K. Aggson, T. F. Rodriguez, B. Mosher, K. L. Levy,
R. S. Hiatt, and G. B. Rhoads, “Digital asset management and linkingmedia signals with related data using watermarks,” Aug. 22 2006. USPatent 7,095,871.
[4] K. Grolinger, W. A. Higashino, A. Tiwari, and M. A. Capretz, “Datamanagement in cloud environments: Nosql and newsql data stores,”Journal of Cloud Computing: Advances, Systems and Applications,vol. 2, no. 1, p. 22, 2013.
[5] E. Redmond and J. R. Wilson, Seven databases in seven weeks: a guideto modern databases and the NoSQL movement. Pragmatic Bookshelf,2012.
[6] Z. Liu, C. Shi, and M. Sun, “‘folkdiffusion: A graphbased tag suggestionmethod for folksonomies,” Information Retrieval Technology, vol. 6458,pp. 231–240, 2010.
[7] B. Choudhary and P. Bhattacharyya, “Text clustering using semantics,”1997.
[8] C. Garboni, F. Masseglia, and B. Trousse, “Sequential pattern mining forstructure-based xml document classification,” in International Workshopof the Initiative for the Evaluation of XML Retrieval, pp. 458–468,Springer, 2005.
[9] E. Gibaja and S. Ventura, “A tutorial on multilabel learning,” ACMComputing Surveys (CSUR), vol. 47, no. 3, p. 52, 2015.
[10] P. Resnick and H. R. Varian, “Recommender systems,” Communicationsof the ACM, vol. 40, no. 3, pp. 56–58, 1997.
[11] N. Developers, “Neo4j,” Graph NoSQL Database [online], 2012.[12] Y. Shinyama, “Pdfminer,” 2010.
-
CONTENTS i
Contents
1 Introduction 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Requirement interviews 3
2.1 Interview process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Interview conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Content management systems: literature study 5
3.1 Background information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Database technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 Relational databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.2 NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Graph database technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Back-end technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Automated document tagging: literature study 12
4.1 Background information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Technical overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.1 Feature extraction and selection . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.2 Document classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.3 Collaborative filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Solutions in literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Single-label learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.2 Multi-label learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
-
CONTENTS ii
5 Back-end structure 26
5.1 Test architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.1 Used Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Layer overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.1 REST layer (back-end) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.2 Controller layer(back-end) . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.3 Conversion layer(back-end) . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.4 Model layer(back-end) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.5 Persistence layer(back-end) . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.6 Data layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.1 Deployment diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Automated graph placement 33
6.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 Parent selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Type classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7 Data and results 42
7.1 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.1.1 Pre-processing of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.1.2 First dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.1.3 Second dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.4 Comparison between datasets . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.1 Parent selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.2 Type classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8 Future work 73
8.1 Additional features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.1 Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.2 Authorization and authentication . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.3 Back-end and database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2 Improving learning component . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
-
CONTENTS iii
8.2.1 Parent selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.2 File annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9 Conclusion 77
A Figures en tables 79
A.1 SVM vs logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A.2 SVM vs random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B Interviews 86
B.1 Interview process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
B.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
B.2.1 Interview with Iris Tavernier . . . . . . . . . . . . . . . . . . . . . . . . . 87
B.2.2 Interview with Carl Lachat . . . . . . . . . . . . . . . . . . . . . . . . . . 88
B.2.3 Interview with Kurt De Mey . . . . . . . . . . . . . . . . . . . . . . . . . 89
B.2.4 Interview with Nathalie De Cock . . . . . . . . . . . . . . . . . . . . . . . 89
B.2.5 Interview with Angelique Vandemoortele . . . . . . . . . . . . . . . . . . 90
C Installation manual 91
C.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.2 Java 1.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.3 Anconda - Python 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.4 Neo4j 3.0.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.5 Maven 4.0.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
C.6 Postman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
-
LIST OF FIGURES iv
List of Figures
3.1 The four layers of impedance mismatch in relational dabase management systems
[4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 A comparison of the architecture of a key-value store and the architecture of a
traditional relational database [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 The architectural pattern of a document store database [4]. . . . . . . . . . . . . 9
3.4 An example of a very simple column family store with two columns as key [4]. . . 9
3.5 The architectural pattern of a graph store database [4]. . . . . . . . . . . . . . . 10
4.1 An example of a tripartite graph used in graph-based methods. The nodes here
consist of users, tags and documents [19]. . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Linear versus non-linear separable data . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Linear support vector machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Non linear support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 An example of a binary tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6 An example of a sentence in the UNL representation [17]. . . . . . . . . . . . . . 20
4.7 An example of a frequent sub-tree in two XML documents [18]. . . . . . . . . . . 20
4.8 The flow of information that results in a recommendation in the AutoTag algo-
rithm [30]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.9 An example of a multi-label learning problem where the labels Yi are not only
dependent on the features Xi, but also on each other [32]. . . . . . . . . . . . . . 23
4.10 A visual representation of the personalized, multi-label tag recommendation prob-
lem [34]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1 JSON output in postman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 static diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
-
LIST OF FIGURES v
5.3 Deployment diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.1 PDF tree structure seen from PDFMiner[36] . . . . . . . . . . . . . . . . . . . . . 35
6.2 The working document is compared to all nodes belonging to other projects in
the database. The most similar node is selected, encircled in red on the figure. . 37
6.3 The parent node of the most similar document to the working document is re-
trieved. In this example the parent node is the project node itself, encircled in
red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.4 The node retrieved in the previous step is compared to every node belonging to
the working project, including the project node (as this node can be a parent too)
but excluding the working document. The node with the highest cosine similarity
is then chosen as the final parent node of the working document. In this case,
this is the project node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.5 The resulting relationship created by the parent-finding algorithm. . . . . . . . . 39
7.1 The number of occurrences in percentage of every type of file extension in the
first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 The number of occurrences in percentage of every type of file annotation in the
first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.3 The number of occurrences in percentage of every type of file extension of parent
nodes in the first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.4 The number of occurrences in percentage of every type of file annotation of parent
nodes in the first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.5 The number of occurrences in percentage of every type of file annotation of PDF
files in the first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.6 The number of occurrences in percentage of every type of file extension in the
second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.7 The number of occurrences in percentage of every type of file annotation in the
second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.8 The number of occurrences in percentage of every type of file extension of parent
nodes in the second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.9 The number of occurrences in percentage of every type of file annotation of parent
nodes in the second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
-
LIST OF FIGURES vi
7.10 The number of occurrences in percentage of every type of file annotation of PDF
files in the second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.11 The number of occurrences in percentage of every type of file extension in the
Shrimp project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.12 A comparison of document composition between the first and second dataset. . . 52
7.13 The amount of correct predictions as a function of the amount of datasets used
to train, with A possibilistic view on set and multiset comparison as test set. . . 56
7.14 A comparison of the file type composition of three different datasets. . . . . . . . 57
7.15 The amount of correct predictions as a function of the amount of datasets used
to train, with Propagation of data fusion as test set. . . . . . . . . . . . . . . . . 58
7.16 The file extensions among parent nodes in two different projects. . . . . . . . . . 58
7.17 Comparison between the file extensions of parents of PDF files in the test set and
train sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.18 Accuracy of Type prediction when training with Biogenic amines part II . . . . . 65
7.19 The composition of extensions of parents in Biogenic amines part I and II. . . . . 69
7.20 Relevance of dataset size for training . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.1 The login screen for CAS (Ghent University). . . . . . . . . . . . . . . . . . . . . 74
A.1 an example of a csv output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
-
LIST OF TABLES vii
List of Tables
6.1 Difference in prediction accuracy for 50 and 300 trees in random forest. . . . . . 41
7.1 The random model success rate per project for the parent selection algorithm,
achieved by always assuming the parent is the project node. . . . . . . . . . . . . 54
7.2 Parent node success rate using only file extensions as features. Only the first
choice is considered a correct result. . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.3 Parent node success rate using only file extensions as features. Both the first and
the second choice count toward a correct result. . . . . . . . . . . . . . . . . . . . 57
7.4 The difference in success rate when using both the first and the second choice,
when compared to only using the first choice when selecting a parent node. . . . 59
7.5 The success rate in selecting the parent of a node when using a feature vector
consisting of file extensions and filedepth. Only the first choice for the parent is
considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.6 The success rate in selecting the parent of a node when using a feature vector
consisting of file extensions and filedepth. Both first and second choices for the
parent are considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.7 The difference in success rate for first choice only parent selection when using
only file extensions as features versus when using file extension and filedepth
as features. Green values indicate a better performance by the feature vector
including filedepth, red values indicate the contrary. . . . . . . . . . . . . . . . . 61
7.8 The success rate in selecting a file’s parents when the feature vector consists of
file extensions and the content of PDF files. Only the first choice for a parent
node is considered here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
-
LIST OF TABLES viii
7.9 The success rate in selecting a file’s parents when the feature vector consists of
file extensions and the content of PDF files. Both the first and the second choices
for a parent node are considered here. . . . . . . . . . . . . . . . . . . . . . . . . 62
7.10 The difference in success rate with and without including PDF content in the
feature vector. A green number indicates that including PDF content performed
better, a red number indicates it performed worse. . . . . . . . . . . . . . . . . . 62
7.11 Accuracy rate of the random model. . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.12 Accuracy rate of SVM type prediction with file extension as features. . . . . . . . 67
7.13 Accuracy rate of SVM type prediction with file extension and parent extension
as features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.14 Difference in accuracy between extension and extension together with parent ex-
tension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.15 Accuracy rate of SVM type prediction with file extension, parent extension and
pdf content as features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.16 Difference in accuracy between extension and extension together with parent ex-
tension and pdf content. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.1 Accuracy rate of logistic regression type prediction with file extension as feature. 80
A.2 Difference in accuracy SVM and logistic regression training on extension only . . 80
A.3 Accuracy rate of logistic regression type prediction with file extension and parent
extension as features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.4 Difference in accuracy SVM and logistic regression training on extension and
parent extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.5 Accuracy rate of logistic regression type prediction with file extension, parent
extension and pdf content as features. . . . . . . . . . . . . . . . . . . . . . . . . 82
A.6 Difference in accuracy SVM and logistic regression training on extension, parent
extension and pdf content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.7 Accuracy rate of randomforest type prediction with file extension as feature. . . . 83
A.8 Difference in accuracy SVM and logistic regression training on extension only . . 83
A.9 Accuracy rate of random forest type prediction with file extension and parent
extension as features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.10 Difference in accuracy SVM and random forest training on extension and parent
extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
-
LIST OF TABLES ix
A.11 Accuracy rate of random forest type prediction with file extension, parent exten-
sion and pdf content as features. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A.12 Difference in accuracy SVM and random forest training on extension, parent
extension and pdf content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
-
INTRODUCTION 1
Chapter 1
Introduction
1.1 Problem statement
Document management is an increasingly difficult thing to implement in university research
groups. The amount of data is ever increasing and staff turnover is a very relevant factor. This
leads to large amounts of often unstructured data that is difficult to process. There is a need
for a clear and organized way to deal with this data, so that it can be incorporated in future
research.
A possible answer to these problems comes in the form of a content management system.
When it comes to these systems, there is one requirement that is significantly more important
than the others: usability. If a new system is more difficult to use than an already existing one,
the new system is never going to be introduced.
Additionally, there are certain problems that arise when multiple users share a content
management system. These problems originate in the fact that different people have different
habits when it comes to data management, for example: a different folder structure, programmes
and file types, document annotations, etc.
1.2 Goal
In order to provide some solutions for the aforementioned problems, we introduce a content
management system with automated document tagging features in this master’s thesis. The
aim of this system is to ensure the fluid continuity of projects in research groups, with minimal
time wasted retrieving information and results gathered in past projects. To achieve this, we
want to minimize manual input and automate things wherever possible. In particular, there are
-
1.3 Thesis overview 2
two things we want to automate:
• The structure of the documents in a project. The system will automatically connect related
documents to one another, avoiding the problem of disorganized and differently structured
data.
• Document annotation. Each document will be given a tag that puts it in a distinct class.
This leads to a database that is easily queryable.
By automating these functions in a standardized way, we hope to achieve a database for use
in research groups which is clearly structured and easy to query. It is possible to retrieve all
metadata belonging to a certain project, or to retrieve all documents with a certain annotation.
All the end user has to do to work with the programme is create a project in the form of a
project node and start uploading data.
1.3 Thesis overview
First, chapter 2 discusses some interviews we conducted with members of the faculty of Bio-
Engineering at Ghent University, in order to get a better understanding of the requirements for
the system.
Chapter 3 contains a literature study on content management systems, specifically on back-
end and database technologies used in these systems.
Chapter 4 is a literature study on automated tagging. It describes a few general methods,
and then takes a look at more specific algorithms developed in recent literature.
Chapter 6 explains the approaches we selected to solve the learning problems in this thesis.
Chapter 7 gives an overview of the data used to test the system. Additionally, it shows
and explains the achieved results in solving the learning problems.
Chapter 5 presents an overview of the back-end and database structure of the system.
Chapter 8 outlines potential future improvements and features, especially with respect to
the requirements obtained from the interviews.
Finally, chapter 9 presents the conclusions made about the tested methods.
-
REQUIREMENT INTERVIEWS 3
Chapter 2
Requirement interviews
In order to know what requirements members of the faculty of Bio-engineering of Ghent Uni-
versity would want in a automated content management system, we conducted some interviews.
This chapter will describe the interview process and the conclusions drawn from these interviews.
For a full text version of the interviews, refer to appendix B.
2.1 Interview process
The interviews were meant to answer questions about potential interest in a content management
system, as well as to gather information about project specifics for later use in the learning
component of the system. For this purpose, we created the following set of questions:
• General explanation of the perceived system and what the functionality is.
• Would you be interested in such a system?
– If yes, what features are crucial to you? Are there any potential improvements you
can think of?
– If no, what features would we have to implement to make you interested?
• What are the first steps you undertake when starting a new project or paper?
• What kinds of files do you generate during your research? What are the specific contents
and extensions of these files? Are there any specific relations between these files? What
is the order of magnitude of the amount of files created or used in a typical project?
-
2.2 Interview conclusions 4
• Do you make use of a specific directory structure during a project and if so, what does it
look like?
• If you were to make use of this system, would you use it to upload singular files during
your research, or to upload the finished project as a reference book?
• How do you manage your references when writing a paper?
2.2 Interview conclusions
The following is a list of conclusions drawn from the interviews:
• There is a certain demand for a content management system within the faculty, as data
management is currently performed in an individual manner. This can make it hard to
find older research.
• Collecting and saving data can be a messy process. Adding some form of automation to
make it more standardized and structured would be welcomed, as long as the system is
easy to use.
• Version control of files is important.
• Some projects use confidential data; as such, there should be some form of authorization
and authentication to work with this data.
• Both proposed use cases (using the system dynamically during research and using it as
a reference book afterwards) were requested. The first because it eases the burden of
keeping up with a folder hierarchy (given enough automation and user friendliness), the
first because it makes it easier to find information about older projects.
• Both the types of files generated during a project and the directory structure vary widely.
• All people interviewed use EndNote for managing references; thus, integrating this software
would greatly improve user friendliness and usability.
-
CONTENT MANAGEMENT SYSTEMS: LITERATURE STUDY 5
Chapter 3
Content management systems:
literature study
This chapter will make a comparison of content management techniques, especially with regard
to possible database technologies. The benefits and downsides of multiple database types are
considered and applied to the context of a content management system. Some existing imple-
mentations will also be discussed and compared to our system, in order to clarify the decisions
made in our adaptation of a content management system.
3.1 Background information
A content management system (CMS) is a digital application that allows for creation, adaptation
and retrieval of digital content. CMSs are often equated to web content management systems
(WCMS), specifically designed for managing web content. This is not the purpose of this thesis,
however: our system is based on a digital asset management system (DAMS). In this type of
CMS, the focus lies on the optimal use, reuse, and repurposing of its assets or data [1]. Still,
WCMSs will also be taken into account in this chapter, as their architecture is often very similar
to the architecture of DAMSs.
For this master’s thesis, we will not discuss a graphical user interface or other front-end
components, as this is out of scope for our system. We will instead focus on the back-end
structure and database technology.
In most content management systems, the content itself is simply stored on a server. In order
to access this content, metadata about each file is stored in a database [2] [3]. The metadata
-
3.2 Database technologies 6
can be used to query the database, as well as locate the relevant files on the server.
3.2 Database technologies
This section will provide a quick overview of the different types of databases and will briefly
examine the strengths and weaknesses of each type. Additionally, it will present a more in-depth
view on graph database technologies, as this is the database type that the system ultimately
employed.
3.2.1 Relational databases
Relational databases have been the backbone of most data-oriented projects for a long time.
They excel at keeping a large amount of data persistent while using a standardized relational
model and query language, namely SQL (Structured Query Language). Relational databases
provide concurrent access through the use of ACID (Atomicity, Consistency, Isolation, Durabil-
ity) transactions [4]. This means the following:
• Atomicity: every part of a transaction must succeed, otherwise the transaction is aborted.
• Consistency: every transaction must end in a valid state for the database, i.e., consistent
with all relevant constraints.
• Isolation: concurrency control of the flow of transactions, i.e., the resulting system state
assumes that all transactions were performed sequentially.
• Durability: once a transaction has completed, the results of this transaction are stored
permanently.
Error handling is done by performing a roll-back of these transactions.
The largest downside of a relational database in the context of content management systems
is the object-relational impedance mismatch between the standardized relational data model
and the in-memory data structures [4]. Figure 3.1 shows the four layers of object-relational
impedance mismatch.
Another important downside of relational databases is that they do not run well on clusters
of computers [4]. This is relevant for big data applications, as scaling out (i.e., lots of smaller
machines) is less expensive than scaling up (i.e., machines with better hardware).
-
3.2 Database technologies 7
Figure 3.1: The four layers of impedance mismatch in relational dabase management systems
[4].
NewSQL
NewSQL is a relatively new category of data stores, providing solutions for better horizontal
scaling while retaining the relational database model. The technologies in this category usually
offer a relational view of the data, although the internal data representation can differ from
this[5]. This type of data store is ideal for applications with strong consistency needs (i.e., some
time-crucial applications) where the structure of the data is known and not likely to change.
The reward for working with a predefined data structure is very efficient and flexible querying
[6].
We have decided against using a relational database technology for this work, however, as
the structure of the data is not known beforehand. Additionally, the structure of projects will
often vary and can change over time (when more documents are added to an existing project).
3.2.2 NoSQL
NoSQL (Not Only SQL) is an umbrella term that includes a large amount of very diverse
database technologies that do not use the relational database model. The technologies that
belong to this class data stores vary widely, but most share the following characteristics [5]:
• Flexible data models: NoSQL data stores offer more flexible schemas and are sometimes
completely without schema.
• Enables large-scale data: NoSQL databases are designed to handle large amounts of data
in a distributed manner.
-
3.2 Database technologies 8
• Provides high availability: in light of the distributed nature of a lot of NoSQL database
systems, they achieve high availability through partition tolerance rather than consistency
(which is the case for relational databases).
• Does not usually rely on ACID transactions; instead, the transactions follow the BASE
(Basically Available, Soft state, Eventually consistent) principles. This indicates that the
database is as a whole is available even if parts of it are not and that it can tolerate
inconsistency for a certain amount of time, but it eventually settles into a consistent state.
There are four main architectural patterns in NoSQL databases.
Key-value store
A key-value store is a database that stores simple key-value pairs [4]. There is no query language;
instead, get-requests are performed by inputting a key and retrieving the corresponding value.
There are no restrictions on data types in this type of database. Figure 3.2 shows a comparison
of the architecture of a key-value store and a relational database. Key-value stores are often
used as a simple REST API with PUT, GET and DELETE inputs.
The benefits of a key-value store are its simplicity and scalability; most use cases of this type
of database require simple data patterns. Within the context of content management systems,
this might not be an optimal choice, as interrelations within the data can be numerous.
Figure 3.2: A comparison of the architecture of a key-value store and the architecture of a
traditional relational database [4].
Document store
Document store databases have a tree-like structure with a root, branches and leafs, as shown
in Figure 3.3. Each node in a document store maps to a unique path expression, consisting
of all the previous nodes up to the root. Each document store has a query language or API,
which is used to traverse these path expressions [4]. Document stores are very flexible, with a
-
3.2 Database technologies 9
wide variety of possible use cases. They are often used in content management systems where
interrelations within the data are simple.
Figure 3.3: The architectural pattern of a document store database [4].
Column family store
Column family stores are a form of key-value pairs, in which the key is a series of columns rather
than a singular value [4]. An example of a simple column store is shown in Figure 3.4, where
the key consists of two columns. Columns can be grouped together in column families. This
way, not all tables have to be read to respond to a given query, only the relevant ones. Column
family stores are made to support sparse data and work best on a cluster of servers. They are
extremely scalable, but are only worth using when working with very large sets of data.
Column family stores are a valid option for content management systems. Column family
stores could be used to model graphs, as they are made to support sparse data, but only when
using an extremely large amount of data.
Figure 3.4: An example of a very simple column family store with two columns as key [4].
-
3.3 Graph database technologies 10
Graph store
The architectural pattern of a graph store database consists of three unique data fields: nodes,
relationships and properties. Nodes can be linked by a relationship, while both nodes and
relationships can have properties. This pattern is visualized in Figure 3.5.
The benefits of a graph store lie in the relationships between nodes. It is easy and efficient
to analyse these relationships and to traverse through nodes that are linked to each other. As
such, graph store databases are a solid choice for handling highly interconnected data [5].
This focus on relations between data leads to a disadvantage too, however. In contrast
with other NoSQL architectures, the structure of graph stores is highly changeable, which leads
to volatile keys. Additionally, the data is queried by capitalizing on the relations between
nodes instead of lookups. This means that partitioning graph databases is significantly more
challenging than partitioning other NoSQL databases [7].
In this thesis, we decided to use a graph store database to store meta data about the used
content. This choice was made because in our system, the emphasis lies on the relationships
between the data files, and the dataset at our disposal is not large enough to warrant the use of
a column family store database. Additionally, graph stores are generally efficient when used in
recommendation-based systems [8]. This is relevant in our work, as Chapter 6 will explain that
we generate relations between graphs in a manner akin to recommendation systems.
Figure 3.5: The architectural pattern of a graph store database [4].
3.3 Graph database technologies
This section will go over some existing implementations of a graph store, as this is the architec-
tural pattern that we use in this master’s thesis. At the time of writing, there are three database
technologies that are more popular than their counterparts:
• Neo4j: currently the most prevalent graph database [9] [10].
-
3.4 Back-end technologies 11
• Titan: a scalable graph database, optimized for modeling extremely large amounts of data
[11]
• OrientDB: a multi-model database, combining elements of a graph store and a document
store [12].
In 2013, Jouili and Vansteenberghe published a paper comparing the performance of these
graph database implementations [13]. They used graph blueprints to test this in a completely
implementation-agnostic way. Their system tested the performance of each database under three
types of workloads:
• Load workload: in this type of workload, vertices and edges are progressively added to the
graph.
• Traversal workload: this type of workload consists of shortest path and neighbourhood
exploration assignments.
• Intensive workload: multiple users concurrently send basic requests to the database.
When testing the load workload, it was found that Neo4j performed better than both Titan
and OrientDB for graphs with under three million vertices. Neo4j also got the best results in
both the shortest path search and neighbourhood exploration tests, regardless of the maximum
hop size used. When testing under an intensive workload, OrientDB outperformed both Neo4j
and Titan. The difference in throughput was marginal, however.
Because the aforementioned results and because of prior experience with the technology, we
have opted to use Neo4j for our system.
3.4 Back-end technologies
Our system uses Java Spring as back-end technology. There are two major reasons for this
choice:
1. Spring contains a Spring Data Neo4j library. This library provides access to a Neo4j
database, including object mapping, transaction handling, etc. The interaction with the
database happens through Cypher, Neo4j’s query language [9].
2. We have previous experience in working with Java Spring. As such, this choice gains us
more time to work on the learning components of the system.
-
AUTOMATED DOCUMENT TAGGING: LITERATURE STUDY 12
Chapter 4
Automated document tagging:
literature study
This chapter will inform the reader about the different possible techniques available in the field
of automated document tagging. Additionally, it will compare these techniques and provide
information as to which ones are most applicable to the system at hand.
4.1 Background information
In computer science, document tagging is the problem of automatically attaching certain descrip-
tive words, or tags, to a certain document. These tags should make it possible for the reader to
gauge what the document is about at a glance, without investing any significant amount of time.
Additionally, this makes it easier to classify documents and data, as heterogeneous datasets can
easily be queried based on these tags.
Document tagging can be done individually, but the biggest perks come with using tags in
a multi-user context. Some problems arise in these situations, however. Golder and Sun cited
four different types of problems when it comes to document tagging in a setting with multiple
users[14][15]. These problems are the following:
• Polysemy: the same word or tag can have multiple meanings, depending on the context
and the author.
• Synonymy: multiple words or tags can have the same meaning.
• Level variation: tagging can happen on multiple abstraction levels, meaning tags are not
-
4.2 Technical overview 13
always comparable.
• Spelling: there can be differences in spelling, be it through spelling mistakes or regional
differences in language, i.e., English and American English.
These issues showcase the need for standardized tagging rules. One solution, in the field of
computer science, is to automatically generate or suggest certain tags. In doing so, the problems
mentioned above are mostly resolved.
4.2 Technical overview
In automated document tagging, machine learning techniques are used to label documents with
certain tags. Generally, this is done in two steps: the first is feature extraction and selection,
and the second is the actual document classification.
Additionally, relevant tags can also be extracted from similar documents or other documents
from the same user. In this approach, collaborative filtering is used to recommend tags. Note
that in this approach, new tags cannot be generated; only tags that have already been used in
the data set can be recommended.
4.2.1 Feature extraction and selection
Feature extraction and selection in the field of document tagging can be divided into three main
categories:
• Content-based methods
• Structure-based methods
• Graph-based methods
In content-based methods, the actual content of a document is used to extract features. This
usually translates to extracting word frequencies or semantic relations from the body of text.
This method is mostly used in text-heavy document types [16][17].
Structure-based methods utilize the structure of a document to compose a feature vector.
Naturally, this method is commonly used to extract features from document types with an
emphasis on structure, like XML or HTML [18].
Graph-based methods aim to make a graph in which the elements are any combination of
users, words, documents and tags. In such a graph, the relationships between the multiple nodes
-
4.2 Technical overview 14
are used to select features for classification. Algorithms where user behaviour is considered are
coined personalized methods. Figure 4.1 shows an example of a graph consisting of users, tags
and documents as nodes.
Figure 4.1: An example of a tripartite graph used in graph-based methods. The nodes here
consist of users, tags and documents [19].
4.2.2 Document classification
After extracting relevant features (mostly in content-based and structure-based approaches), the
documents still have to be classified. The type of model used for classification largely depends
on the problem (and data) at hand: if the data is mostly linearly separable, a simple (linear)
model will be enough. If not, a more complex model is necessary. Figure 4.2 shows an example
of linearly separable data and non-linearly separable data. In literature, the three most common
classifier types we encountered are support vector machines (SVM’s), decision trees (or random
forests using decision trees) and neural networks [16][17][20]. In the following subsections the
three classification types that were used in this work (SVM, logistic regression and random
forest), will be explained more thoroughly.
Support vector machines
Support Vector Machines (SVMs) will be explained based on the paper by Tong et al [21]. In the
following section we presume the case of a binary classification problem. It is straight forward
how SVMs work when there are more than 2 classes. For the mathematical principles we refer
to the paper itself.
In their simplest form, SVMs are hyperplanes that separate the training data by a maximal
margin, as seen in Figure 4.3. All points lying on one side of the margin belong together and
-
4.2 Technical overview 15
(a) Linearly separable data (b) Non-linearly separable data
Figure 4.2: Linear versus non-linear separable data
Figure 4.3: Linear support vector machine.
are part of the same class. The functionality of SVMs is based on support vectors. These are
the points that lie closest to the hyper plane. Instead of calculating the hyperplane over all data
points it is only done over the support vectors. SVMs allow to transform the training data into
a higher dimension and therefore a hyperplane will always be found. This can be done in several
ways with the use of kernels. An example can be seen in Figure 4.4.
Random forests
According to the definition of Leo Breiman: A random forest is a classifier consisting of a
collection of treestructured classifiers hpx,akq, k � 1, ...where the ak are independent identically
distributed random vectors and each tree casts a unit vote for the most popular class at input
x[22]. In other words, a tree is a random vector which is independently generated form all other
trees, but with the same distribution. After all trees have been generated they make a majority
vote for the most popular class. An important characteristic is that for a large number of trees
the result converges. This follows form the Strong Law of Large Numbers [22].
-
4.2 Technical overview 16
Figure 4.4: Non linear support vector machine
First, let us elaborate more on the individual decision trees. Two types of trees can be
distinguished: classification trees and regression trees. As we are only interested in classifying
data, we will only focus on the first type of tree. Trees can be build according to several
algorithms: CHAID (CHi-squared Automatic Interaction Detector), CART (Classification And
Regression Trees), MARS (Multivariate Adaptive Regression Splines), etc. In this dissertation,
trees are built following the CART principle. CART analysis is a form of binary recursive
partitioning: every tree can be seen as a series of nodes, where each node represents a binary
decision. The first node is called the root node, while nodes that do not split any further are
called leafs[23]. Figure 4.5 shows a binary tree with four leafs.
Root
Leaf
Leaf
LeafLeaf
Figure 4.5: An example of a binary tree.
-
4.2 Technical overview 17
Growing decision trees is a greedy top-down procedure. The procedure starts at the root
node and progressively splits the data in smaller and smaller subsets. Splitting is done on a
chosen attribute (feature). This attribute is chosen so that the split makes the data as pure as
possible (in its most purest form we can assign one class to the subset of data.) To make the
most optimal choice for the attribute we can measure the misclassification impurity iptq:
iptq � 1 �maxj
P pCj | Nq (4.1)
Where P pCj | Nq is the fraction of the training data in category Cj that ends up in node N. For
iptq several functions can be used here we will only discuss the Gini impurity. Gini uses for i(t):
iptq �¸
i�j
ppCi | NqppCj | Nq (4.2)
which gives the expected error rate at node N if the category node is selected randomly
1
2r1 �¸
j
p2pCj | Nqs (4.3)
Resulting in the fact that the Gini algorithm will search for the largest class and isolate is form
the rest of the data (purest data ). It is proven that Gini works well with noisy data[24].
To prevent over fitting caused by noise in the data or by fully growing a tree, the lower splits
are decided based on less data and therefore might be poor splits, meaning we can prune the
tree. There are 2 option: pre pruning, stop growing the tree when there is insufficient amount of
data to make the split, and post pruning, fully grow the tree first and then remove sub trees. In
CART, Post pruning is chosen because this often leads to the best results. We will not discuss
these methods any further as random forest classification uses fully grown trees.
Before going further with random forest we will briefly go over the time complexity of a
decision tree. If we indicate our amount of data points with N and our dimensionality (the
amount of features) with D the runtime complexity is Oplog2Nq and our training complexity is
OpDN2log2Nq[24]. It is clear that training a tree is an expensive task but running a test is not.
Random forest creates an ensemble of many unpruned trees. According to Breiman random-
ness is very important [22]. By injecting the right amount of randomness for each tree one can
minimize the correlation between trees while maintaining the strength. This is done by ran-
domly selecting features or feature combination at each node to make the split. This procedure
has the following advantages:
1. Its accuracy is as good as Adaboost and sometimes better.
-
4.2 Technical overview 18
2. It is relatively robust to outliers and noise.
3. It is faster than bagging or boosting.
4. It gives useful internal estimates of error, strength, correlation and variable importance.
5. It is simple and easily parallelized.
By using bagging in combination with random feature selection one can minimize the cor-
relation. Each new training set is then drawn, with replacement from the original training set.
Then a tree is grown on this new training set using the random feature selection. Important
is that the trees are not pruned, which is not necessary because of the bagging and weak cor-
relation between trees. In general approximately one-third of the total training set is left out
for each new training set. This unused one-third (the out-of-bag data) can be used for valida-
tion. Bagging is used because it improves accuracy and because it can give an estimate on the
generalization error, the strength and the correlation for the combined ensemble[22].
Trees are grown according the earlier explained CART methodology, where the split at each
node is done by a random group of features. Trees are fully grown and not pruned. As the split
is done on a random subset of features this results in faster training OpDlog2N{M with M the
number of input variables, and minimizes the inter-tree dependencies.
Logistic regression
Logistic regression is a mathematical model. The model is used to predict the outcome of an
event based on one or more features. It arises from the desire to model the posterior probabilities
of the K classes via linear functions in x, while at the same time ensuring that they sum to one
and remain in [0, 1] [25]. Classifying is done by minimizing an entropy function. For a more
complete view on the mathematics behind logistic regression, we refer to Hastie et. al [25].
4.2.3 Collaborative filtering
Recommender systems are filtering systems that attempt to predict the value or rating that
a certain user would give a certain item. In recommender systems, collaborative filtering is
a technique in which available information about similar users or items is used to make this
recommendation [26].
Collaborative filtering can either be user-based or item-based. In the former version, the
system searches users who share the same rating patterns as the user at hand and then uses
-
4.3 Solutions in literature 19
information about these users to calculate a prediction for the active user. In the latter, a similar
approach is taken with items: first similar items are found, then information about these items
is used to make a prediction. Similarity between two items can be calculated in different ways
depending on the application [27].
When one replaces the numeric value or rating with a tag, collaborative filtering can also be
used for automated tag recommendations. There is a certain amount of background information
necessary about either the user or the document (or both) for this technique; as such, this
technique is most applicable in graph-based methods, where such information is kept in the
form of graph relations. It is worth noting that no new tags can be recommended using this
approach, as all recommendations are based on tags that have been added to the system in the
past.
4.3 Solutions in literature
In this section, we will give an overview of specific methods found in recent literature. First,
in subsection 4.3.1, single-label classification will be discussed. In subsection 4.3.2 we will then
discuss some solutions for adding multiple labels to a single document.
4.3.1 Single-label learning
Content-based
Choudhary presented a content-based method for clustering text using a Universal Networking
Language (UNL) to capture semantic relations between words. By using this semantic rep-
resentation, Choudhary achieves better clustering results when compared to a bag of words
representation using word frequencies [17]. In the UNL representation, the relations between
words in a sentence are visualized in a graph structure, as depicted in Figure 4.6. In this
method, classification is done by using a neural network based technique called Self Organizing
Maps (SOM), in which similar documents are mapped to the same nodes.
Structure-based
Garboni et al. published a data-mining method for clustering XML documents using only
structure-related information [18]. This data-mining algorithm searches documents for structural
-
4.3 Solutions in literature 20
Figure 4.6: An example of a sentence in the UNL representation [17].
pattern similarities which the author coined frequent sub-trees. An example of such a tree is
shown in Figure 4.7. Classifying happens in three different steps in this method:
1. Remove irrelevant XML tags.
2. Define the classes by extracting multiple frequent sub-trees from the training set.
3. Classify new documents in one of the defined classes by using a custom distance measuring
algorithm.
Figure 4.7: An example of a frequent sub-tree in two XML documents [18].
-
4.3 Solutions in literature 21
Graph-based and collaborative filtering
Lipczak proposed a tag recommendation system for folksonomies catered to individual users
[28]. This graph-based method classifies documents in three steps. In a first step, basic tags are
extracted from the title of a document. These tags are chosen based on a score for each word in
the title, which is equal to the amount of times this word has been chosen as a tag divided by
the number of occurrences. The second step serves to make a lexicon of all the possibly related
tags for the document at hand. This lexicon can be built in two ways:
1. Add other tags from the same resource to the lexicon.
2. Tags closely related to the tags extracted from the title.
In the third step, the list of tags extracted in step 2 is narrowed down. The system uses person-
omy based filtering to accomplish this: tags closely related to the user (or similar to tags the
user has used before) will get higher priority.
Jäschke et al. created another graph-based tagging algorithm for folksonomies called FolkRank
[29]. In this method, a graph is constructed with users, tags and documents representing nodes.
When recommending a tag, the available tags are given a weight according to a ranking algo-
rithm like PageRank, based on the relations in the created graph. This algorithm suffers from
a problem, however, as the PageRank algorithm will always jump to the most globally popular
tags given enough iterations. Liu et al. proposed a method to fix this issue in the FolkRank
method, called FolkDiffusion [16]. This algorithm uses a ranking algorithm based on the physics
of heat diffusion rather than PageRank, overcoming the problem of topic drift.
Mishne created a tag recommender for webblog posts called AutoTag [30]. This algorithm
uses a collaborative filtering approach for tag recommendation, in which the blog posts them-
selves take on the role the user would have in a standard recommender system. The tags are
considered to be the items that can be recommended to the users. The system then follows a
traditional recommender system approach, which means tags assigned to blog posts similar to
the active blog post are recommended. The information flow is depicted in Figure 4.8.Similarity
is measured by querying an information retrieval system that has access to a large collection
of blog posts. Queries are generated from the active post in several ways, the most effective of
which was searching based on the most distinctive terms in the post.
-
4.3 Solutions in literature 22
Figure 4.8: The flow of information that results in a recommendation in the AutoTag algorithm
[30].
Tatu et al. also proposed a graph-based solution to the tag recommendation problem, more
specifically on bookmark data [31]. In this method, a certain document is identified by a
triple including the bookmarked content, the user, and tags given to the bookmarked content.
Additionally, relevant meta-data is also added to a bookmark. Natural language processing
tools like WordNet are used to create well functioning feature vectors by stemming concepts
and linking synonyms. Similar tags are then grouped together in conflated tags, avoiding the
polysemy, synonymy and spelling problems mentioned in the introduction of this chapter. Both
existing tags and new tags can be recommended using this algorithm: the former by distance
measuring between similar users or documents and the latter by measuring weights of words
used in the content of the bookmark.
4.3.2 Multi-label learning
In multi-label learning problems, each document can have multiple labels or tags. The possibility
for multiple labels introduces a new challenge, as there might be some dependencies in the set
of possible tags or labels. Figure 4.9 shows an example of such a problem, wher