A graph database management system for a logistics-related ...

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2016

A graph database management system for a logistics-related service

MARCUS WALLDÉN

AYLIN ÖZKAN

KTHSKOLAN FÖR INFORMATIONS- OCH KOMMUNIKATIONSTEKNIK

A graph database management system for alogistics-related service

Marcus Wallden & Aylin Ozkan

Bachelor of Information Technology Thesis

Information TechnologySchool of Information and Communication Technology

KTH Royal Institute of Technology

Stockholm, Sweden

15 September 2016

Examiner: Anne HakanssonSupervisor: Alf Thomas Sjoland

c© Marcus Wallden & Aylin Ozkan, 15 September 2016

Abstract

Higher demands on database systems have lead to an increased popularity of cer-tain database system types in some niche areas. One such niche area is graphnetworks, such as social networks or logistics networks. An analysis made onsuch networks often focus on complex relational patterns that sometimes can notbe solved efficiently by traditional relational databases, which has lead to the in-fusion of some specialized non-relational database systems. Some of the databasesystems that have seen a surge in popularity in this area are graph database sys-tems.

This thesis presents a prototype of a logistics network-related service using agraph database management system called Neo4j, which currently is the mostpopular graph database management system in use. The logistics network cov-ered by the service is based on existing data from PostNord, Sweden’s biggestprovider of logistics solutions, and primarily focuses on customer support andbusiness to business.

By creating a prototype of the service this thesis strives to indicate some of thepositive and negative aspects of a graph database system, as well as give an indi-cation of how a service using a graph database system could be created.

The results indicate that Neo4j is very intuitive and easy to use, which wouldmake it optimal for prototyping and smaller systems, but due to the used evalu-ation method more research in this area would need to be carried out in order toconfirm these conclusions.

Keywords. Graph database, Relational database, Prototype, Logistics, Graphanalysis, NoSQL

i

Abstract

Hogre krav pa databassystem har lett till en okad popularitet for vissa databas-systemstyper i nagra nischomraden. Ett sadant nischomrade ar grafnatverk, sasomsociala natverk eller logistiknatverk. Analyser pa grafnatverk fokuserar ofta pakomplexa relationsmonster som ibland inte kan losas effektivt av traditionella re-lationsdatabassystem, vilket har lett till att vissa specialiserade icke-relationelladatabassystem har blivit populara alternativ. Manga av de populara databassyste-men inom detta omrade ar grafdatabassystem.

Detta arbete presenterar en prototyp av en logistiknatverksrelaterad tjanst somanvander sig av ett grafdatabashanteringssystem som heter Neo4j, vilket ar detmest anvanda grafdatabashanteringssystemet. Logistiknatverket som tacks avtjansten ar baserad pa existerande data fran PostNord, Sveriges ledande leverantorav logistiklosningar, och fokuserar primart pa kundsupport och foretagsrelateradanalys.

Genom att skapa en prototyp av tjansten stravar detta arbete efter att uppvisa vissaav de positiva och negativa aspekterna av ett grafdatabashanteringssystem samt attvisa hur en tjanst kan skapas genom att anvanda ett grafdatabashanteringssystem.

Resultaten indikerar att Neo4j ar valdigt intuitivt och lattanvant, vilket skulle goraden optimal for prototyping och mindre system, men pa grund av den anvandaevalueringsmetoden sa behover mer forskning inom detta omrade utforas innandessa slutsatser kan bekraftas.

Nyckelord. Grafdatabas, Relationsdatabas, Prototyp, Logistik, Grafanalys, NoSQL

iii

Acknowledgements

We would like to thank our supervisor Thomas Sjoland and our examiner AnneHakansson, who helped us with a wide variety of issues during the course of thisproject. We also appreciate the assistance of Petter Edlund and Torbjorn Sjogren,acting as supervisors at PostNord, who gave us great insight into the company andthe various systems that we used.

Stockholm, September 27, 2016Marcus Wallden & Aylin Ozkan

v

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Benefits, Ethics and Sustainability . . . . . . . . . . . . . . . . . 31.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.8 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.9 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Database Systems 72.1 Database Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Database System . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Database Management System . . . . . . . . . . . . . . . 7

2.2 DMBS Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Relational Database Management System . . . . . . . . . 82.2.2 Graph Database Management System . . . . . . . . . . . 9

2.3 Graph DBMS Products . . . . . . . . . . . . . . . . . . . . . . . 92.3.1 Neo4j . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Titan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.1 A Comparison of a Graph Database and a Relational

Database . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.2 Using Neo4j for mining protein graphs . . . . . . . . . . 12

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Methodology 153.1 Research Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . 15

vii

viii CONTENTS

3.3 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.6 Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 System Development Methodology 194.1 Waterfall Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Software Prototyping . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Spiral Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Modelling of Software Development Processes 255.1 Development Model . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Consumer Profiling System . . . . . . . . . . . . . . . . . . . . . 265.3 Graph Data Model . . . . . . . . . . . . . . . . . . . . . . . . . 285.4 Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.6 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 32

5.6.1 Test Environment . . . . . . . . . . . . . . . . . . . . . . 325.6.2 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . 325.6.3 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.6.4 Trello . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.7 Assessing Reliability and Validity . . . . . . . . . . . . . . . . . 335.7.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . 335.7.2 Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.8 Evaluation Framework . . . . . . . . . . . . . . . . . . . . . . . 34

6 Implementation 356.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2 Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2.1 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . 396.2.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.2.3 Database Creation . . . . . . . . . . . . . . . . . . . . . 41

6.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.3.1 Customer Support . . . . . . . . . . . . . . . . . . . . . 41

6.3.1.1 Find Shipment by Kolli-id . . . . . . . . . . . . 416.3.1.2 Find Shipments for a Consignee . . . . . . . . . 426.3.1.3 Find Shipments for a Consignee from a Consignor 426.3.1.4 Find Top 10 Suitable Delivery Points for a Con-

signee . . . . . . . . . . . . . . . . . . . . . . 436.3.2 Business To Business . . . . . . . . . . . . . . . . . . . . 43

CONTENTS ix

6.3.2.1 Find all Shipments Sent to Organisations . . . . 436.3.2.2 Find Biggest Organisations in an Area . . . . . 446.3.2.3 Find Consignees’ Consignees . . . . . . . . . . 44

7 Graph Database Management System for a Logistics-related Service 477.1 Prototype Results . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7.1.2.1 Find Shipment by Kolli-id . . . . . . . . . . . . 487.1.2.2 Find Shipments for a Consignee . . . . . . . . . 497.1.2.3 Find Shipments for a Consignee from a Consignor 507.1.2.4 Find Top 10 Suitable Delivery Points for a Con-

signee . . . . . . . . . . . . . . . . . . . . . . 507.1.2.5 Find all Shipments Sent to Organisations . . . . 517.1.2.6 Find Biggest Organisations in an Area . . . . . 527.1.2.7 Find Consignees’ Consignees . . . . . . . . . . 53

7.2 Interview Results . . . . . . . . . . . . . . . . . . . . . . . . . . 547.2.1 Graph Data Model . . . . . . . . . . . . . . . . . . . . . 547.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.2.2.1 Customer Support . . . . . . . . . . . . . . . . 557.2.2.2 B2B . . . . . . . . . . . . . . . . . . . . . . . 55

7.2.3 Consumer Profiling System . . . . . . . . . . . . . . . . 557.3 Reliability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 567.4 Validity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 567.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577.5.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.5.2.1 Customer Support . . . . . . . . . . . . . . . . 577.5.2.2 B2B . . . . . . . . . . . . . . . . . . . . . . . 57

7.5.3 Consumer Profiling System . . . . . . . . . . . . . . . . 58

8 Conclusions and Future work 598.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

9 Summary 63

References 65

A Database Creation 69

x CONTENTS

B Customer Support 71B.1 Find Shipment by Kolli-id . . . . . . . . . . . . . . . . . . . . . 71B.2 Find Shipments for a Consignee . . . . . . . . . . . . . . . . . . 71B.3 Find Shipments for a Consignee from a Consignor . . . . . . . . . 71B.4 Find Top 10 Suitable Delivery Points for a Consignee . . . . . . . 72

C Business to Business 73C.1 Find all Shipments Sent to Organisations . . . . . . . . . . . . . . 73C.2 Find Biggest Organisations in an Area . . . . . . . . . . . . . . . 73C.3 Find Consignees’ Consignees . . . . . . . . . . . . . . . . . . . . 74

List of Figures

2.1 A visual illustration of a graph with two nodes and one edge. . . . 92.2 A graph data model with nodes containing properties . . . . . . . 102.3 A query written in Cypher . . . . . . . . . . . . . . . . . . . . . 11

4.1 A graphical representation of the waterfall model. . . . . . . . . . 204.2 A graphical representation of the six stages of software prototyping. 214.3 A visual representation of the spiral model, including the four re-

occurring steps in a clockwise spiral. . . . . . . . . . . . . . . . . 22

5.1 A visual representation of the development model. . . . . . . . . 265.2 The iterative phases of creating the graph data model. . . . . . . . 295.3 The model for data migration. . . . . . . . . . . . . . . . . . . . 305.4 The model used to create queries. . . . . . . . . . . . . . . . . . . 31

6.1 The basic components implemented for the graph model. . . . . . 366.2 The advanced graph model. . . . . . . . . . . . . . . . . . . . . . 376.3 The complete graph model that will be used to create the graph

DBMS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7.1 A Shipment node with its relations and properties. . . . . . . . . . 497.2 All Shipments sent to a Consignee. . . . . . . . . . . . . . . . . . 497.3 Two Shipments sent from a Consignor to a Consignee that has a

Party. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.4 Top 10 Delivery Points by delivered Shipments for a Consignee. . 517.5 The top 10 Organisations that an Organisation has sent Shipments

to, ordered by the amount of Shipments. . . . . . . . . . . . . . . 527.6 Organisations that received the highest amount of Shipments in an

Area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.7 The Consignees of an Organisation’s Consignees, ordered by the

amount of Shipments. . . . . . . . . . . . . . . . . . . . . . . . . 54

xi

List of Tables

2.1 Two visual examples of representation in an RDBMS. . . . . . . . 8

5.1 Terminology used for the specification of requirements. . . . . . . 275.2 Specification of Requirements - Customer Support. . . . . . . . . 285.3 Specification of Requirements - Business To Business. . . . . . . 285.4 Hardware setup used for Neo4j environment. . . . . . . . . . . . 32

6.1 Visual example of two files containing a list of Person or Pet withcorresponding properties. . . . . . . . . . . . . . . . . . . . . . . 40

6.2 Visual representation of how a relationship between a Person anda Pet might look. . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.3 All CSV files with all the migrated data in nodes and relationships. 40

7.1 Node types and their corresponding properties. . . . . . . . . . . 48

xiii

List of Acronyms and Abbreviations

This thesis requires readers to be familiar with multiple terms related to databases,logistics and other areas. As such the most important terms have been specifiedprior to their use in the coming chapters.

DBMS Database Management System

RDBMS Relational Database Management System

SQL Structured Query Language

JSON JavaScript Object Notation

CSV Comma Separated Values

B2B Business To Business

xv

Chapter 1

Introduction

1.1 BackgroundDigitally stored data can be stored and accessed in many different ways, result-ing in many different types of database management systems, DBMS. DBMSscan manage and store data, update it and perform analyses [1]. This kind ofdata storage has been popular ever since its inception, allowing companies andother groups to store customer information, product specifications and many otherthings.

Many DBMS types exist, but they generally are split into two groups: relationaland non-relational [2]. Relational database management systems, RDBMS, havebeen in use for decades and is the most popular type of DBMS [2]. RDBMS is aspecific type of a DBMS, but non-relational database management systems referto all types of DBMSs that are non-relational, i.e not an RDBMS [2]. As suchthis term refer to many different kinds of DBMSs, all of which have their ownunderlying principles and uses. Many non-relational DBMSs are specialized ona specific niche market [2], potentially allowing such systems to offer more cus-tomized solutions in a specific area.

1.2 Problem definitionWhen DBMSs were first put in use, the amount of data and the analysis complex-ity tended to be quite low. Furthermore, the stored data was often static. As timewent on, systems started to grow more complex as more features and functionalitywere added, resulting in higher amounts of data, higher analysis complexity and a

1

2 CHAPTER 1. INTRODUCTION

big increase in data ingestion [3].

Relational DBMSs are commonly used in many situations, even though they havemultiple limitations, such as being quite static and bad at handling large amountsof unstructured data [3]. As some systems evolved and became larger and morecomplex some of these limitations became more apparent. As such there was aneed to find alternatives to RDBMSs in some niche areas where RDBMSs did notperform adequately.

Graph networks, i.e a connected set of graphs and edges, is one such area, evi-dent from its generally complex data analysis and relational structures. As graphnetworks grow, with added functionality and features, they generally contain morecomplex relational properties and larger quantities of data. This in turn creates aproblem for certain DBMSs such as RDBMSs. Due to its static structure and otherfactors it is hard to create an RDBMS that scales well with certain graph networksystems.

As such there is a need to find an alternative DBMS that can solve the problemsrelated to graph networks, such as logistics networks. One type of non-relationalDBMS that has specialized in the area of graph networks is the graph databasemanagement system, which aim to represent graph networks in an effective way.Graph DBMSs have become popular alternatives to relational databases in re-gards to graph networks. Such DBMSs are currently in use by many companies,especially for logistics-related purposes, including fortune 500 companies such aseBay [4]. Because graph DBMSs are fairly new, more information is needed inorder to prove that they are valid alternatives to RDBMSs in the field of graphnetworks, and what positive and negative properties they have. Furthermore moreinformation of how a graph DBMS could be created and used is needed in orderto raise awareness and interest in graph DBMSs.

1.3 PurposeThe purpose of this thesis is to develop a prototype of a service based on a graphnetwork using the graph database management system Neo4j. The graph net-work portrays a logistics system used by PostNord, Sweden’s biggest providerof logistical services [5], and the data used in the service is taken from some oftheir existing systems. The service is to provide multiple analysis tools relatedto customer support and business to business, as defined by a specification of re-quirements provided by PostNord, and is targeted at end customer and businesses.

1.4. GOALS 3

The main objective of the thesis is to pinpoint how such a service using a graphdatabase management system can be developed, as well as to explore some of theadvantages and disadvantages of the system.

By developing such a system and showcasing its creation and advantages it is pos-sible that entities facing similar graph networks-related problems, or other typesof problems for that matter, will become interested in exploring non-relationalDBMSs. More interest in this area could potentially result in more specializedsolutions being found and thus solve some of the existing problems many entitiesface with RDBMSs today.

1.4 GoalsThe goal of this thesis and the degree project is divided into the following sub-goals:

1. Develop a prototype of a logistics-related service using a graph DBMS

2. Showcase how such a prototype could be created

3. Analyse the end product and identify some of the potential strengths andweaknesses of the graph DBMS

To summarize, the goal is to portray the aspects of both the development processand the end product. By focusing on both aspects it is possible to better displaythe potential uses of a graph DBMS.

1.5 Benefits, Ethics and SustainabilityThis thesis strives to benefit all types of entities that use DBMSs by showcasinghow a graph DBMS can be created and used. The degree project builds on exist-ing systems at PostNord and is built upon a specification of requirements that theyhave created. As such PostNord directly benefits from the results and conclusionsreached by this thesis.

The degree project is based on existing data from PostNord, which means thatsensitive and private information potentially is used. In order to not compromisepotentially sensitive data, all data from PostNord used throughout the thesis isanonymized.

4 CHAPTER 1. INTRODUCTION

The results and conclusions of this thesis are based on the current versions ofa variety of software systems. Some systems are currently under development,meaning that future versions of the software systems potentially could provideresults that differ from that of this thesis. The results and conclusions offer aninsight into the current state of graph DBMSs, and could provide different entitieswith information that could improve the sustainability and performance of theirdatabase systems.

1.6 Methodology

A research methodology includes concepts such as paradigms and theoreticalmethods and models. It can be viewed as a framework for the research process,i.e all the steps and phases included in a project.

There are two research methodologies: Qualitative and Quantitative [6]. Qualita-tive and quantitative research methodologies differ in the way they collect, analyseand validate data, in addition to how the research strategy is designed. The quan-titative research methodology often consists of measuring variables, where theresults must be evaluable and measurable [6]. Qualitative research methodology,on the other hand, focuses on behaviours and perceptions, and can include resultsthat are not measurable, but rather based on opinions.

The goal of the thesis is to display how a prototype of a graph DBMS is createdand used, meaning that the results are largely based on opinions and perceptions.The project also does not use any analysis of measurable statistical data in orderto reach the results, meaning that a qualitative research methodology is used.

1.7 Stakeholders

The stakeholder for this thesis and the degree project is PostNord. PostNord isthe biggest provider of logistics solutions on the Swedish market, handling closeto 85% of all distributed letters in 2014 [5]. Due to their field of work as wellas their huge customer base they have a lot of complex data and are interested infinding alternatives to RDBMSs. They have as such provided the specification ofrequirements used for the degree project.

1.8. DELIMITATIONS 5

1.8 DelimitationsDue to the scale of the degree project certain components were overlooked, suchas:

• Cost Analysis

The cost of the products and services used in this degree project have not beenconsidered and are not a part of this thesis. Furthermore the cost of the necessarydevelopment work has not been estimated and will not be taken into consideration.

• Comparison Analysis

Due to the time limitation of the degree project no comparisons to other types ofDBMSs were made. The aim of the thesis is primarily to present the developmentprocess and uses of a graph database system and as such no comparison is made.

• Ethical Analysis

The prototype created in this degree project handles potentially sensitive and pri-vate information, meaning that using the data in an actual service could be uneth-ical, depending on its uses. The ethical aspect is not taken into consideration inthis thesis, but the data used throughout the project is modified as to avoid anyprivacy concerns.

• Security Analysis

The handling of private and sensitive information potentially requires certain secu-rity measures in regards to the created service and the database system itself. Dueto this only being a prototype of the service such aspects have been overlooked.

1.9 Structure of the thesisChapter 2 presents background information about database systems and intro-duces all important systems that are used throughout the degree project. Chapter3 details the different methodologies used in the thesis and the underlying degreeproject. Chapter 4 explains the system development method that is used for thedegree project, while Chapter 5 details the different development models and pro-cesses of the degree project. Chapter 6 describes the implementation process andChapter 7 presents and discusses the results. Lastly Chapter 8 provides a conclu-sion to the thesis and then Chapter 9 delivers a summary of the thesis as a whole.

Chapter 2

Database Systems

This chapter provides the theoretical background of this thesis, specifically in re-gards to the different terms and products related to database systems that are used.Different database management systems are explained and related works are dis-cussed.

2.1 Database TermsMany different database-related terms exist, and this section provides informationabout the different database-related terms used throughout the thesis.

2.1.1 DatabaseA database simply stores a collection of data in a structured format [1]. The storeddata can be images, text, numbers, and so forth. Worth noting is that there is norequirement that the data is stored on a computer.

2.1.2 Database SystemA database system consist of one or more databases as well as software to ac-cess and process the stored data [7]. In contrast to a database, a database systemcould be used on a large scale to store data, thanks to the streamlined access andprocessing methods.

2.1.3 Database Management SystemA database management system, DBMS, is a software application that creates andmanages databases [8]. It offers the same usability as a database system, i.e access

7

8 CHAPTER 2. DATABASE SYSTEMS

and process stored data, but it also provides other features. Generally a databasemanagement system offers tools to manipulate and analyse stored data, in additionto administrative tools such as logging and backup services.

A DBMS also consists of a database model, which defines the database’s logi-cal structure [8]. The logical structure defines how data is stored in the databasesof the DBMS. There exist many different database models, many of which func-tion in different ways. This in turn means that there exist different DBMS types,since the logical structure of the stored data will change how a DBMS functionsas a whole.

2.2 DMBS Types

This section provides basic information about some DBMS types. Many differenttypes of DBMSs exist, but due to the scale of the thesis only relevant types areaddressed in this section.

2.2.1 Relational Database Management System

The website db-engines.com keeps a ranking system of the biggest DBMSs inthe world, based on popularity. Of the top 10, seven are relational database man-agement systems [9]. This illustrates the massive market share RDBMSs have intoday’s world.

Table 2.1: Two visual examples of representation in an RDBMS.

ID Name Age Pet ID522 John 31 5523 Charlie 25 6

ID Type5 Cat6 Dog

An RDBMS stores data in a tabular fashion, consisting of rows and columns,where tables can relate to other tables, illustrated in Table 2.1 which contains twoexample tables, thus creating the relational connectivity needed by graph networksand many other types. Each table often consist of a single entity type along withthe entity’s properties. Common for all RDBMSs is that they are based on thedefinition of the relational model, created by E.F. Codd in 1970 [10].

2.3. GRAPH DBMS PRODUCTS 9

2.2.2 Graph Database Management SystemGraph DBMSs are fundamentally different from RDBMSs and many other typesof DBMSs in the sense of how they store data. While an RDBMS generally storesthe same type of entity (e.g person, dog, cat, ...), thus creating a simple way ofaccessing and analyzing entities of the same type, graph DBMSs store data asnodes, edges and properties. A graph is a set of vertices connected by a set ofedges (also known as lines) [11]. Throughout the thesis vertices are also referredto as nodes or points.

Figure 2.1: A visual illustration of a graph with two nodes and one edge.

Nodes generally represent entities (e.g Person, Dog, Cat, ...), edges usually repre-sent the relation between nodes (i.e Person HAS Pet) and the properties representthe attributes of the nodes and edges. As such nodes are directly connected toits related entities, rather than the same type of entities, which is highly commonfor relational DBMSs. In Figure 2.1, the relation between a Person and a Pet isvisualized using a graph.

2.3 Graph DBMS ProductsThis section presents two of the most popular graph DBMS alternatives that arecurrently available; Neo4j and Titan. Due to the scale of the thesis alternativegraph DBMS solutions other than Neo4j are not discussed in detailed, but basicinformation of Titan is presented to provide a perspective of other possible solu-tions.

2.3.1 Neo4jThe first complete version of the graph DBMS Neo4j was released in 2010 andnow the ecosystem comprises of hundreds of thousands of developers, effectivelymaking it the most popular graph DBMS in the world [12]. Today the DBMS is


widely in use, including many fortune 500 companies such as eBay [4], that isusing Neo4j for some services related a logistics network, which in this thesis isdefined as a system of operations that work together to deliver a product to themarket. It includes the process of collecting shipments, transporting shipmentsand delivering shipments to the end customer.

The company behind Neo4j claims that it is ”wicked fast” [13], which indeed isbacked up by some third party benchmarks [14], albeit not all [15]. The efficiencyof the DBMS is largely dependent on the database’s graph data model, which isfairly unique compared to other types of DBMSs. Continuing on the analogy of aperson and its pet, Figure 2.2 illustrates how such a graph data model could look.A Person ,who has a name, gender and age, has a Pet, which has a name, age andtype (such as cat or dog).

Figure 2.2: A graph data model with nodes containing properties

The graph data model serves three main purposes:

1. Illustrate how a system works

The example graph data model presented in Figure 2.2 covers a very basic re-lationship between two types of nodes, but one could imagine a very complexsystem, containing hundreds of types of nodes and relationships. Such systemswould be hard to understand or explain to others. The graph data model offers anintuitive way of understanding the inner workings of a system, even more complexones, and could be a valuable component in product development.

2. Define how nodes are connected in the DBMS

The graph data model also explains how nodes are connected to one another; aPerson can have a Cat, but a Cat can not have a Person. All data stored in Neo4jhas to be structured as per the graph data model definition. Relations not specified

2.3. GRAPH DBMS PRODUCTS 11

in the graph data model should as such not exist in the database, meaning that itis clear which relations could exist in the database and which could not.

3. Define how the data of nodes and edges are stored

The graph DBMS Neo4j stores data as nodes and edges, like most other graphDBMSs. This means that the structure of the stored data is at large defined by thestructure of the graph data model.

Neo4j uses a declarative graph query language called Cypher in order to per-form analysis on its database. Cypher borrows its structure from SQL [16], whichis a standardized query language used by many relational databases. Two of themost important aspects of Cypher are its ease of use and intuitivity, which can beillustrated by a simple example, shown in Figure 2.3.

MATCH (p:Person)-[:HAS]->(c:Pet)WHERE c.type="Cat"RETURN p.name

Figure 2.3: A query written in Cypher

The example query shown in Figure 2.3 tries to find a person, possibly more thanone, who has a cat as a pet. It then returns the names of all people who has a cat.It contains the three most fundamental pillars of Cypher:

• MATCH

The MATCH defines the pattern of the query. In the example a person, p, has apet, c, and the query will look for all patterns in the graph database that match thisdescription.

• WHERE

The WHERE filters the result in some way. In the example it is defined that onlypets of the type Cat are of interest, so all other previous matches are removed fromthe result.

• RETURN


The RETURN defines what the output should be. In the example the names of thepeople who have cats are returned, but many other things could also be outputtedfrom the result, such as the cat’s name, the person’s age, and so forth.

Cypher has many more features, but due to the scope of the thesis only the fun-damental aspects are covered. More information can be found in Neo4j’s Cyphermanual [17].

2.3.2 TitanTitan is the third most popular graph DBMS in the world [18] and has gained alarge following since its initial release in 2012 [19]. It is open source and free touse [20].

Titan has a big focus on scalability [20] and is optimized for distributed machineclusters [20].

2.4 Related workDue to Neo4j being the most popular graph database in the world a lot of studieshave been done that focus on many different aspects of the product. As such thissection strives to identify other studies that contain information, methods or ideasthat may be usable for this thesis.

2.4.1 A Comparison of a Graph Database and a RelationalDatabase

A study from 2010 focused on a comparison between the graph DBMS Neo4j andthe RDBMS MySQL [21]. The study touches upon many interesting subjects thatare relevant for this thesis.

It contains a small discussion about Cypher, Neo4j’s query language, and pointsout several advantages, such as ease of use, specifically pointing out advantagesin regards to the examined graph traversals, which is the act of traversing betweennodes in a graph, generally outperforming the RDBMS it was compared to.

2.4.2 Using Neo4j for mining protein graphsD. Hoksza and J. Jelınek wrote a thesis in 2015 that focused on using Neo4j tomine protein graphs [22]. Although the direction of their thesis is different from

2.5. SUMMARY 13

this several methods and ideas could still be seen as interesting.

In contrast to the study discussed in Section 2.4.1, this study mainly focuses onsubgraph matching, the process of trying to match a certain pattern to a part ofa graph, rather than graph traversals. In this area, for the queries handled in thestudy, Neo4j performed poorly, indicating that such queries will not perform aswell as queries about graph traversals. Finally the study also questions Neo4j’sperformance on larger graph networks, but also points out that it is competitive insmaller graphs.

2.5 SummaryFor this thesis Neo4j will be used. It is the most used graph DBMS in the world[18] and provides many interesting parts such as the graph data model and thedeclarative graph query language Cypher. Many other graph DBMSs, such asTitan, exist, but due to the limited time frame of this thesis these will not be dis-cussed in great detail.

Sections 2.4.1 and 2.4.2 discussed two studies that included Neo4j. The firststudy, which focused on comparing Neo4j and the RDBMS MySQL [21], foundthat Neo4j outperformed in regards to graph traversal queries. This stands in con-trast to the second study, in which Neo4j was used to mine protein graphs [22].In that study Neo4j underperformed in regards to subgraph matching. Combined,these two studies provide an indication of how Neo4j performs in certain typesof queries, and whether or not this remains true for the degree project will beinteresting to find out.

Chapter 3

Methodology

This chapter provides a detailed overview of the methods used to carry out re-search, data collection and quality assurance in this thesis as well as the underly-ing degree project.

3.1 Research Paradigm

The research paradigm sets the point of view of the project, meaning that it definesthe principles that guide how the project as well as the results are viewed. As suchit is an essential part of the project.

There are four core paradigms that could be used: Positivism, Realism, Inter-pretivism and Criticalism [23, 24]. Depending on which paradigm is used theproject and its results are evaluated in different ways. Interpretivism is often usedin projects focused on opinions and perspectives and are often used in softwaredevelopment [6]. It is often inductive in nature, which is highly applicable tothis project. In addition, the results of the project is largely interpreted based onopinions and perceptions and as such interpretivism has been used for this project.

3.2 Research Approach

The research approach is used to draw conclusions and decide whether or notsomething is true or false [6]. There are two main areas in the research approach:Inductive or Deductive [6]. The inductive research approach is often used in com-bination with the qualitative research methodology. In contrast to the deductive

15

16 CHAPTER 3. METHODOLOGY

research approach, results are often based on opinions and perspectives. Fur-thermore the deductive research approach is generally used to try to validate hy-potheses or theories, meaning that the inductive research approach would be moresuitable for the project, and is therefore used.

3.3 Research DesignThe research design provides the guidelines for how research should be conductedthrough the project. This includes aspects such as planning, organizing and con-ducting the actual research [6]. There are three possible research designs thatcould be matched with the current research approach, which are Action Research,Exploratory Research and Grounded theory [6].

Action Research is generally used to find solutions and improvements to exist-ing problems or concerns. It is a cyclic method of taking a certain action and thenobserving and analyzing the outcome [6].

Grounded theory strives to create a theory that is grounded in data [6]. Groundedtheory is dependent on a systematic collection and analysis of data to create theunderlying theory [6].

Finally, Exploratory Research aims to identify issues and different variables ofa problem, using a qualitative data collection [6]. It rarely strives to find firmsolutions to a problem, rather it aims to provide information about the problem’sdifferent aspects.

This thesis is largely based on qualitative data such as interviews. Furthermorethe main goal is to identify different aspects of a system, rather than trying tosolve some problem. As such the exploratory research design is used for the the-sis, which best fit this description.

3.4 Data CollectionGiven the paradigm used throughout the thesis, there is only one data collectionthat can be used, which is the Interviews method [6]. The method consist of ei-ther open-ended or closed-ended questions, in a structured or unstructured format.The method strives to understand and document the point of view of a group ofparticipants [6]. This matches perfectly with the thesis, since the main goal isto identify the positive and negative aspects of a system, which can be done by

3.5. DATA ANALYSIS 17

asking open-ended unstructured questions to a group of participants.

3.5 Data AnalysisThere are two possible data analysis methods that can be used with the data col-lection method used in this thesis. Those are Analytic induction and Groundedtheory. Both are iterative in nature, and collects and analyse data until an hypoth-esis has been reached that can not be dismissed by the existing data [6]. The maindifference between the two methods is that analytic induction ends when a hy-pothesis has been reached, which grounded theory ends when a validated theoryhas been reached [6].

Once again, one of the main goals of the thesis is to identify the positive andnegative aspects of a system, which means that a hypothesis needs to be reachedafter analyzing the existing data. This conforms perfectly with the analytic induc-tion method, which is why it is used through the thesis.

3.6 Quality AssuranceThere are three possible quality assurance methods that can be used: Dependabil-ity, Confirmability and Transferability [6]. Dependability uses auditing to confirmthat conclusions are correct, Confirmability makes sure that personal assessmentshave not interfered with the research or the results and Transferability creates de-tailed descriptions that can be used as a reference or database by others [6].

Interviews are used to gather information about the results of the underlying thesisproject, which means that the Dependability method is the most appropriate forthis thesis, and is as such used.

Chapter 4

System Development Methodology

A system development methodology is a framework that structures and controlsthe development process of a system or service [25]. More than one developmentmethodology can be used for a project, and this chapter discusses some system de-velopment methodologies that are relevant for the degree project, and then decideswhich one to use. Some system development methodologies are first introduced,after which the methodology to use for this thesis is selected.

4.1 Waterfall Model

The waterfall model offers a linear framework with multiple sequential phases.The main emphasis is on planning, target dates and implementing the whole prod-uct at one time. There is also a focus on written documentation as well as formalreviews [25]. The model is divided into five main steps, as illustrated in Figure 4.1.

In the first step, Requirements, requirements for the software is gathered. In De-sign, the second step, a design is created that dictates the process of implementa-tion [25]. In the third step, Implementation, the implementation of the software iscarried out as specified by the design. In the fourth step, Verification, the systemis verified and tested so that it conforms to the requirements. In the fifth step,Maintenance, the system is deployed and supported [25].

19

20 CHAPTER 4. SYSTEM DEVELOPMENT METHODOLOGY

Figure 4.1: A graphical representation of the waterfall model.

The waterfall model offers many positive aspects. Due to the clear definition ofthe different stages, the progress of the development is measurable. The extensivedocumentation and clear development process make it easier to add new develop-ers to a project.

Due to the linear progression it is hard to use any iteration, which might be neededfor some types of products, especially prototypes. Furthermore, the model makesit hard to adapt to changes of the product or the requirements, due to limitedbacktracking abilities. Because of the lack of iterations and the fact that all theimplementation is done before the verification stage, it might also be hard to solvecertain issues or bugs in the software [25].

4.2 Software Prototyping

Prototyping is an iterative methodology that often is used as a supplement to othermethodologies. Often it is utilized to handle certain parts of the developmentstage [25]. When using the prototyping methodology, small versions of a systemare usually created, and then expanded upon until the final correct version is com-pleted. Figure 4.2 showcases the different steps in the prototyping methodology.

4.3. SPIRAL MODEL 21

In the first step, Requirements Gathering, requirements are gathered, followedby the second step, Design, in which a design is created for how the process ofimplementation should be carried out. In step three, Prototyping, a prototype iscreated that fulfills some of the requirements of the system. In step four, Evalua-tion, the created prototype is evaluated by customers in order to provide feedback[25]. In step five the prototype is either seen as adequate, i.e fulfilling all of therequirements, in which case it continues to step six, Deliver System, where thesystem is deployed, or, if the system is inadequate, step two is once again reached[25].

Figure 4.2: A graphical representation of the six stages of software prototyping.

Due to the methodology’s iterative nature, it is easy to identify errors and bugsduring the implementation phase. It is also easy to confirm that all parties havea clear understanding of the requirements of the system, due to the continuousevaluations that occur in each iteration [25]. Changes to the requirements or theproduct itself can be easily countered due to its iterative nature.

The many iterations could lead to longer development times and costs. Fur-thermore, documentation may also be neglected, which can lead to poor designchoices and also limit a product’s future potential [25]. Lastly, the methodologycan lead to a lack of quality, as mock-up code and design choices might persistthroughout the many iteration, leading to an inferior final product.

4.3 Spiral ModelThe spiral model offers a framework type that is both linear and iterative in na-ture. The model focuses on developing in cycles, also known as phases, meaningthat there is a primarily linear development in each cycle, but that different partsof a system or service are developed iteratively; each phase adding to the part of

22 CHAPTER 4. SYSTEM DEVELOPMENT METHODOLOGY

the service that is complete [25]. Figure 4.3 illustrates how the spiral model works.

In the first step, Analysis, objectives of the phase are determined and require-ments are identified. In the second step, Evaluation, risks and other elements areidentified and resolved. In the third step, Development, the implementation is car-ried out and in step four, Planning, the next phase is planned, after which step onecommences again with the next phase.

Figure 4.3: A visual representation of the spiral model, including the four reoc-curring steps in a clockwise spiral.

Due to the structure of the model, it is easy to utilize other system developmentmethodologies, such as waterfall or prototyping, for some phases where it mightbe seen as appropriate. The model also improves the control of risks [25].

How the model is used varies depending on the project, and is as such quite com-plex with limited reusability. The complex nature of the model might also lead tohigher development costs and longer development times.

4.4 SummaryIn this chapter, three different system development methodologies have been in-troduced and explained to some extent. First the waterfall model was introduced,which offers a linear development framework that makes it easier to plan and doc-ument the development process. Then the software prototyping methodology wasintroduced, which is iterative and often used as a supplement to other methodolo-gies. Lastly the spiral model was presented, which is both linear and iterative in

4.4. SUMMARY 23

nature. In certain phases of the spiral model, other methodologies could be usedin order to simplify the development process.

Section 2.3.1 explained the main aspects of Neo4j, such as the graph data modeland the queries. In addition, there is also a need to create the actual database inNeo4j. All of these aspects that need to be developed are separate in nature, mean-ing that there is no need to specifically use a methodology such as the waterfallmodel. Instead, an iterative methodology could be used. Software prototypingis seldom used by itself as a system development methodology, which leaves thespiral model.

The spiral model is perfect in the sense that different phases are done in a lin-ear fashion, which fits the degree project perfectly since the different aspects areseparate from each other. Furthermore the spiral model can utilize other method-ologies for some of the phases, which would further improve the developmentprocess. Creating queries, as an example, if often an iterative and exploratoryprocess, meaning that software prototyping might be usable in such a situation.Because of the many positive aspects that the spiral model provides, it is used asthe system development methodology for the degree project.

Chapter 5

Modelling of Software DevelopmentProcesses

This chapter details the development model of the prototype service and definesthe specification of requirements provided by PostNord. Individual developmentprocesses of the different aspects of the prototype service are also introduced, fol-lowed by hardware and software used throughout the development phase. Lastlythe evaluation framework of the degree project is presented in detail.

5.1 Development Model

The definition of a development model used in this thesis is that it dictates how thesoftware development should be carried out, by utilizing the system developmentmethodology chosen in Chapter 4.

As explained in Section 2.3.1, a graph data model needs to be created. Further-more, queries need to be created using Cypher. In addition the data also needsto be collected, parsed and imported into Neo4j, in order to create the database.These are the aspects that need to be developed, i.e. the three phases that willexist, using the spiral method, as illustrated in Figure 5.1.

25

26 CHAPTER 5. MODELLING OF SOFTWARE DEVELOPMENT PROCESSES

Figure 5.1: A visual representation of the development model.

There are three phases, the graph data model phase, the data migration phase andthe analysis phase. The analysis phase, in which the queries are created, needs tohave access to the complete database in Neo4j before the phase can be finished,meaning that the phase data migration needs to be completed before the analysisphase. Likewise, in order to create the database in Neo4j the graph data modelneeds to be complete. As such the first phase is to create the graph data model,the second stage to migrate the data into Neo4j and the third to create the queriesin the analysis phase.

Both the graph data model and the query creation processes are iterative and ex-ploratory in nature, meaning that the prototyping method would be optimal to usein those two phases. As such the prototyping method is loosely used for bothphase one and three, which is defined in detail in Section 5.3 and 5.5.

5.2 Consumer Profiling System

To accomplish the goals set by this thesis a prototype of a system would need tobe created. The system would need to be an interesting and realistic real worldscenario in order to truly showcase the uses of a graph DBMS. As such PostNordhas provided a specification of requirements of a mock-up consumer profiling sys-tem.

In order to understand the terminology used when discussing the consumer pro-

5.2. CONSUMER PROFILING SYSTEM 27

filing system, basic information about the most relevant terms has been specifiedin Table 5.1.

Table 5.1: Terminology used for the specification of requirements.

Term MeaningConsignor Sender of a shipmentConsignee Receiver of a shipmentDelivery Point Delivery point of a shipmentParty Contact information of a consigneeOrganisation Company Information of a consignee/consignorTrade Category/sector of an organizationZip Area Country code + zip codeEstimated Time of Arrival (ETA) The estimated time of arrivalOriginal ETA (oETA) First estimated time of arrival of a shipmentActual Time of Arrival (ATA) Actual time of arrival of a shipmentvolume The volume of a shipmentweightunit kg, g, etc.service service of shipmentlon Longitude coordinateslat Latitude coordinatesKolli-id The ID of a shipment

Two scopes of use were provided for the specification of requirements; one thatfocused on customer support, which in this thesis is defined as helping end cus-tomers, and one that focused on business to business, B2B, which in this thesis isdefined as providing information or analysis to companies or organisations. Thespecification of requirements was based on what the system should be able to do,i.e all requirements could be seen as queries that the system should be able tocorrectly handle during the analysis phase. Table 5.2 contains the requirementsrelated to customer support and Table 5.3 the requirements regarding B2B.


Table 5.2: Specification of Requirements - Customer Support.

Query DescriptionFind Shipment by Kolli-id Use a shipment’s kolli-id to find

information about itFind Shipments for a Consignee Find all shipments for a consignee

by using identifiers such as name,phone number and email

Find Shipments for a Consignee from Find all shipments for a consigneea Consignor from a Consignor by using identifiers

such as name, phone number and emailFind Top 10 Suitable Delivery Points Find the 10 most suitable delivery pointsfor a Consignee for a consignee, ordered by the consignee’s

favourite delivery points

Table 5.3: Specification of Requirements - Business To Business.

Query DescriptionFind all Shipments Sent to Organisations Finds all shipments sent from one

organisation to other organisationsFind Biggest Organisations in an Area Finds the organisations that send the

highest amount of shipments to an areaFind Consignees’ Consignees Find all consignees of an organisation’s

consignees

Tables 5.2 and 5.3 present the query names and also a description of what thequeries should be able to accomplish. As such, all created queries need to fulfillthe definitions specified in this section.

5.3 Graph Data ModelIn this section the process of creating the graph data model is presented. Thegraph data model is one of the most important components of the prototype. Assuch it is paramount to find a correct abstraction of the problem, i.e construct themodel in such a way that it correctly covered all aspects of the system in all pos-sible situations. An iterative process was created to find the abstraction, which isillustrated in Figure 5.2.

5.3. GRAPH DATA MODEL 29

Figure 5.2: The iterative phases of creating the graph data model.

As specified in Figure 5.2 the process was divided into six steps. These steps areexplained below, starting with the first step, ”Read Related Documentation”, andending with the last step, ”Complete”.

The first step, Read Related Documentation, includes accessing relevant docu-mentation of the system that is to be created as well as information about thesource systems from which the raw data will be acquired.

The second step, Build Model, then handles the creation of an iteration of themodel, which is represented by a graph containing the needed data identified instep one.

Once an iteration of the model has been completed, the third step, Consult SystemAdministrators, is reached, where the model is presented to key personnel with adeep knowledge of the source systems or the system to be created.

The fourth system, Correct Model?, identifies whether the created model is cor-rect, upon which either the fifth or the sixth step is reached, depending on thefeedback of the system administrators in step three.

The fifth step, Identify Errors, is reached if the created model does not correctlyrepresent the new system. The possible faults in the model are then identified byeither consulting the system administrators or accessing relevant documentation.Once the faults have been identified step two is once again reached, and a newiteration of the model can be created.

The sixth and final step, Complete, is reached once the created model is correct;i.e. once it covers all aspects of the system in all possible situations according tothe specification of requirements.


5.4 Data Migration

This section presents the process of how the data needed for the database will bemigrated into Neo4j. The raw data input could potentially come from many dif-ferent sources, but for the sake of the thesis’ scope and time restrictions, only themost basic case is handled. Most often data can be found on existing DBMSs thatare in use. Four steps have been identified in order to migrate the required datafrom other DBMSs to Neo4j. All the steps are identified in Figure 5.3, where therelations between the steps also are identified.

Figure 5.3: The model for data migration.

The first step involved accessing and retrieving all relevant raw data in some struc-tured format, such as JSON or CSV, both of which potentially would require someparsing and cleaning, step two, in order to convert to the input structure neededfor Neo4j.

Once the raw data has been converted to the correct format it could then be im-ported into Neo4j, step three, where it could be used for analysis by using thegraph query language Cypher, step four.

5.5 Analysis

This section presents the model that will be used to create queries in Neo4j for theprototype service. The prototype needed to be able to perform certain analyses,as per the specification of requirements. Figure 5.4 illustrates the method usedfor creating queries. In total the method was divided into six steps, which areexplained below.

5.5. ANALYSIS 31

Figure 5.4: The model used to create queries.

The first step, Read Query Requirements, includes the accessing information aboutthe query to be implemented, as well as understanding the underlying structure inthe logistics network.

The second step, Construct, involves the process of creating a query to meet therequirements specified in the specification of requirements, using the informationfrom step one.

Then the query is run in the third step, Run Query, and different interesting ex-amples of the query result could be found. This could include binding a queryto a specific company or person, which would lead to actual real world scenarioswhich could be more easily verified. This is further explained in Section 6.3.

The fourth step, Expected Results? involves checking whether or not the queryfunctions according to the specification of requirements. This could be done inmany ways depending on the scenario. As mentioned in the third step, somequeries could be bound to a specific person or company, for example, in whichcase it would be possible to manually calculate the result of a specific query. Itcould also involve consulting people familiar with the logistics system. If the re-sults appear to be correct the query is complete, in which case the sixth step isreached. If the results seem to be incorrect, the fifth step is reached.

The fifth step, Identify Errors, identifies the possible errors in the query. Oncean error has been found the second step is once again reached, where a new querywill be created that does not contain the error.

Once the sixth step, Complete, is reached, the query is complete. At this stepthere should be no questions left about whether or not the query is correctly con-structed according to the specification of requirements.


5.6 Experimental DesignThis section presents the test environment as well as various software programsand hardware unrelated to DBMSs that are used throughout the developmentphase of the underlying degree project.

5.6.1 Test EnvironmentAll tests showcased in this thesis are run on a laptop using Neo4j’s standardbrowser, which is able to visualize the results in the form of a graph, in additionto providing much information such as the run times of queries [26].

5.6.2 Hardware SetupTable 5.4 provides information about the hardware used throughout the wholethesis project, including the test phase.

Table 5.4: Hardware setup used for Neo4j environment.

Name Microsoft Surface Pro 3Processor Intel Core i5 4300UGPU Intel HD Graphics 4400Internal Storage 128GB SSDRAM 4GB DDR3Operating System Windows 10 Pro 64-bit

5.6.3 JavaJava is an object-oriented programming language [27] and is in the underlying the-sis project used to parse raw data into the correct input format that Neo4j requires.The reason why specifically Java was used was due to existing internal libraries atPostNord could decrease the development time needed to create a parser for theNeo4j input data.

5.6.4 TrelloTrello is a task management product and can be used to manage the work flow ina project [28]. In the thesis project Trello was used to coordinate the implementa-tion of the parser and the query creation by following the development processes.

5.7. ASSESSING RELIABILITY AND VALIDITY 33

Many other task management product exist and Trello is not specifically neededfor a project of this nature.

5.7 Assessing Reliability and Validity

Being able to assess the reliability and validity of a created product is hugelyimportant in order to be able to confirm results and conclusions. This sectiondetails the process of how the results are validated and of how they are assessedto be reliable.

5.7.1 Reliability

There are two main aspects of the results where the reliability needs to be as-sessed. These are the graph data model and the query results.

The graph data model will be assessed in interviews by asking key PostNord em-ployees if the graph data model is reliable, i.e if it covers every possible scenario.

The query results in Neo4j need to be consistent and give a specific output givena specific input. It is assessed by the employees at PostNord in interviews, byconfirming that the outputs are the same throughout multiple test runs.

5.7.2 Validity

There are two aspects of the degree project that need to be validated. The firstaspect is the graph data model of the logistics network in use by PostNord and thesecond is the results of the queries in Neo4j.

The model is valid if it can fulfill two conditions. The first condition is that themodel does not stray from underlying documentation of the logistics network andthe second condition is that no administrator of the network at PostNord claimsthat it is incorrect. These conditions is checked by conducting interviews with keyPostNord employees and asking about the model’s correctness.

The validity of the output of the queries run in Neo4j are assessed by conductinginterviews with employees at PostNord and asking them if the output seeminglyis valid, which can be confirmed by assessing the raw data or assessing existingdatabase systems.


5.8 Evaluation FrameworkThe usability, strengths and weaknesses of Neo4j are not clearly defined. It highlydepends on the use case and the people who use the system. In this case PostNordhas provided the specification of requirements used by the thesis project, and assuch they are in the best position to provide feedback of the end result. Evaluationdata of the graph data model, the queries, both results and implementation, and theprototype as a whole will be gathered both throughout the development processand from the final product.

As explained in Section 3.4, the evaluation of the prototype is based on inter-views. There exist many different ways to conduct interviews, and this thesis willuse unstructured and informal interviews in order to collect evaluation data. Thismeans that there will be no structured interview questions and no specific personwill be bound to a specific response. The reason why structured interviews arenot used in this thesis is in order to capture the true feelings of the interviewedemployees at PostNord. Much of the feedback is gathered throughout the de-velopment process, which means that the interview questions would have to bemodified multiple times. Furthermore, since the project is very exploratory withno firm grasp of what the results would be, limiting the interviewed employees toanswering predefined questions might limit the findings of the thesis.

The service proposed in the specification of requirements touches upon many ofPostNord’s existing services and systems, meaning that different people will haveinsight into different aspects of the prototype service that is to be implemented.In addition, some employees might have a clear bias for or against graph DBMSs,such as system administrators or employees that work with graph DBMS-relatedproducts. Due to concerns like these there have been two limitations set in placethat determine whether an employee at PostNord can contribute with feedback.The interviewed employees must first and foremost be knowledgeable of the as-pect of the service that they are providing feedback of and they must not have aclear bias for or against graph DBMS-related products.

Chapter 6

Implementation

This chapter explains the implementation process of the prototype and includesthe various steps and challenges that arose throughout the development process.The implementation phases closely follow the defined development model anddevelopment processes defined in Chapter 5.

6.1 ModelThe iterative model illustrated in Figure 5.2 was followed, and a large focus duringthe first stage (Read Related Documentation) was to identify various flows in themodel, i.e parts of the model that were connected in a clear way. These flowscould then easily be converted into a part of the model. At first the three most basiccomponents were identified. These were the Consignor, Shipment and Consignee,as well as the relations between the three nodes, illustrated in Figure 6.1.

35

36 CHAPTER 6. IMPLEMENTATION

Figure 6.1: The basic components implemented for the graph model.

As illustrated in Figure 6.1, a Consignor sends a Shipment to a Consignee. Thisis the most basic part of the model and is in some way connected to every queryspecified in the specification of requirements. Expanding upon this, a Consigneehas a Party, which contain information about that Consignee, e.g. name, phonenumber, email, and so forth. In addition, a Shipment is picked up at a location,called DeliveryPoint, and the DeliveryPoint, Consignee and Consignor all need tobe located at some location, called ZipArea. Adding these pieces together bringsforth the model illustrated in Figure 6.2.

6.1. MODEL 37

Figure 6.2: The advanced graph model.

The graph data model shown in Figure 6.2 contains most of the informationneeded by the queries related to customer support, but it does not cover any busi-ness aspects, meaning that it does not cover the queries related to B2B. Both aConsignor and a Consignee can be a part of an Organisation, and an Organisationcan be a part of a Trade. Adding these pieces to the model illustrated in Figure6.2 then completes the process, i.e a model that correctly explains how the systemworks has been created, which is shown in Figure 6.3.


Figure 6.3: The complete graph model that will be used to create the graph DBMS.

6.2 Data Migration

This section explains how the raw data related to the logistics-related service wascollected and passed to Neo4j. The data was needed in order to be able to suc-cessfully run the queries in the finished prototype.

6.2. DATA MIGRATION 39

6.2.1 Data Retrieval

This section presents how data needed for the prototype service was collected.The data required by the consumer profiling system in order to create the queriesspecified in the specification of requirements could be retrieved from some of thestakeholder PostNord’s existing RDBMSs. Although the majority of the retrieveddata was in a CSV format, some was in a JSON format, meaning multiple parsershad to be used in order to convert the input to the structure required by Neo4j.

In order to limit the amount of data used by the prototype, only data from a shortperiod of time was used. This resulted in around 9 million shipments, 3,5 millionconsignees, 90.000 consignors and 60.000 zip areas, all of which were retrievedfrom PostNord’s existing DBMSs.

6.2.2 Parsing

Prior to passing collected data into the Neo4j database, some data needed to bemodified. This was done by using a parser, which is explained in this section. Inorder to create a versatile and modular parsing system Java was used. As men-tioned in Section 6.2.1, two parsers were needed in order to convert both data fromJSON format and CSV format to the input structure needed by Neo4j. In addition,a substantial amount of the retrieved data was not needed by the customer pro-filing system, meaning that such data needed to be removed before creating thedatabase in Neo4j.

A lot of the retrieved data was also based on user input, i.e information that thecustomer had defined without any guidelines. As such the parsing process alsoneeded to take into account many different aspects such as names with or withoutthe use of capital letters, different phone numbers, and so forth.

In order to use the data for this thesis project it was also necessary to anonymizesome of the data so that it would not affect PostNord’s customers. This was doneby using a hash function combined with a type identifier. The hash function con-verts a sequence of characters into a number that represents the original sequence.

Continuing with the example of a Person who has a Cat, to input such informationinto Neo4j would consist of three CSV files; one of which contain informationabout the relation between the Person and the Cat, and two to represent the Personand Cat, respectively. This is illustrated in Table 6.1, which represent an examplestructure of the the Person and Pet nodes, and Table 6.2 that represents the relationbetween the two nodes.


Table 6.1: Visual example of two files containing a list of Person or Pet withcorresponding properties.

ID :LABEL nameperson1 Person NSven

ID :LABEL name typepet1 Pet NSnow TCat

As illustrated in Table 6.1, Sven has the identifier N, Name, prior to his name tosignify that the value is a name. Snow also has an N in front of its name, and Cathas a T, Type, in front of it. These identifiers serve no specific purpose to Neo4j,but it is useful for the user to be able to differentiate between the different hashvalues created by the parser.

Table 6.2: Visual representation of how a relationship between a Person and a Petmight look.

:START ID(Person) :END ID(Pet)person1 pet1

In Table 6.2 we set a start ID to person1, Sven’s ID, and an end ID to pet1, Snow’sID. This creates an edge from Sven to Snow.

Due to privacy concerns the exact parsing process or the parsed data can not beincluded in the thesis, but one CSV file was created for every type of node andedge in the final graph data model shown in Figure 6.3. As such the files shownin Table 6.3 were created.

Table 6.3: All CSV files with all the migrated data in nodes and relationships.Node Relationship

zipAreas.csv to.csvconsignors.csv sent.csvconsignees.csv belongs to.csvshipments.csv has.csv

deliveryPoints.csv in.csvparty.csv located at.csv

organisation.csv picked up at.csvtrade.csv

6.3. ANALYSIS 41

6.2.3 Database CreationOnce the data files had been created, they could be imported into Neo4j. Thiswas done using the windows command prompt and a specified command, whichis shown in Appendix A. relationships:X x signifies that X is what therelationships in the file x should be labeled as in Neo4j. Neo4j then proceeds tocreate the database, after which the analysis can begin.

6.3 AnalysisIn this section the implementation process of the queries defined in the specifi-cation of requirements are implemented. The process is divided into two subsec-tions, which describe the queries related to customer support and B2B, respec-tively.

6.3.1 Customer SupportFour queries were provided in the specification of requirements that are related tocustomer support. These are individually implemented in this section.

6.3.1.1 Find Shipment by Kolli-id

The goal of this query is to use a shipment’s kolli-id to find information about it.Obviously a shipment has to be sent by a consignor and sent to a consignee, sothese are the first pieces that can be identified to be a part of the query.

MATCH ( c r : Cons igno r ) −[:SENT]−>( s : Shipment ) −[:TO]−>( ce : Cons ignee )

The consignee then has a party, which also could be included in order to add somemore information about the consignee to the result.

MATCH ( c r : Cons igno r ) −[:SENT]−>( s : Shipment ) −[:TO]−>( ce : Cons ignee ) −[:HAS]−>(p : P a r t y )

Then the query need to be matched to the specific kolli-id, after which the resultscan be returned.


WHERE s . i d =”X”RETURN cr , s , ce , p

In the final version of the query, X represents the kolli-id, and after the query hasfinished information about consignors, shipments, consignee and party is returned.


6.3.1.2 Find Shipments for a Consignee

In this query the goal is to find all shipments for a consignee by using identifierssuch as name, phone number and email. Obviously the shipments, the consigneeand the party will be a part of the query. It is of no interest from which organisationthe Shipments come from, so that does not need to be included in the query.

MATCH ( s : Shipment ) −[:TO]−>( ce : Cons ignee ) −[:HAS]−>( p : P a r t y )

The current version of the query will find all shipments that are sent to a consignee,so there is a need to bind it to a specific consignee. As mentioned, this can be doneby binding it to a consignee’s email, phone number or name. The result can thenbe returned.


WHERE ( p . e m a i l =”X” OR p . mobi le =”Y” ) AND p . name=”Z”RETURN s , ce , p

In the final version of the query, X represents the email, Y the phone number and Zthe name. The query then returns information about the shipment, the consignee,and the consignee’s party.

6.3.1.3 Find Shipments for a Consignee from a Consignor

The goal of this query is to find all shipments for a consignee from a consignor byusing identifiers such as name, phone number and email. This is highly similar tothe query handled in Section 6.3.1.2, only that it is bound to a specific consignor.As such the first iteration of the query will look almost the same, only that theconsignors that sent the shipments are included.


Once again there is a need to bind the query to a specific consignee, by email,phone number, name and so forth. The consignor also needs to be defined, whichcan be done by providing the name of the consignor.


WHERE ( p . e m a i l =”X” OR p . mobi le =”Y” ) AND p . name=”Z”AND c r . name=”W”

RETURN cr , s , ce , p

6.3. ANALYSIS 43

In the final version of the query, X represents the email, Y the phone number, Zthe name of the consignee and W the name of the consignor. Information aboutthe consignor, shipments, consignee and the consignee’s party is then returned.

6.3.1.4 Find Top 10 Suitable Delivery Points for a Consignee

The goal if this query is to find the 10 most suitable delivery points for a con-signee, ordered by the consignee’s favourite delivery points, i.e the ones that aremost in use by the consignee.

To start things off, the goal is to find where all shipments are delivered, so theshipments, the consignee and the delivery points need to be included in the query.

MATCH ( s : Shipment ) −[:TO]−>(m: Cons ignee ) ,( s : Shipment ) −[:PICKED UP AT]−>(n : D e l i v e r y P o i n t )

The current query will run for all consignees, so there is a need to bind it to aspecific consignee. Once it is bound, the results can be returned.

MATCH ( s : Shipment ) −[:TO]−>( ce : Cons ignee ) ,( s : Shipment ) −[:PICKED UP AT]−>(dp : D e l i v e r y P o i n t )

WHERE ce . name=”X”RETURN dp . name , c o u n t ( s ) ORDER BY c o u n t ( s ) DESC LIMIT

10

In the final version of the query, X represents the name of the consignee. The 10most used delivery points, counted by the amount of shipments, are then returned,coupled with the number of shipments that have been delivered to the locations.

6.3.2 Business To BusinessIn this section all three queries provided in the specification of requirements thatare related to B2B are implemented.

6.3.2.1 Find all Shipments Sent to Organisations

The goal of this query is to find all shipments sent from one organisation to otherorganisations.

MATCH ( o : O r g a n i s a t i o n ) <−[:BELONGS TO]− ( : Cons igno r )−[:SENT]−>( s : Shipment ) −[:TO]−>(: Cons ignee )−[:BELONGS TO]−>(o2 : O r g a n i s a t i o n )

The query then needs to be bound to the organisation that the shipments are sentfrom. Then the results can be returned.



WHERE NOT o . name=o2 . name AND o . name=”X”RETURN c o u n t ( s ) AS s e n t , o2 . name AS o r g a n i s a t i o nORDER BY s e n t DESC

In the final version of the query, X represents the name of the sending company.The organisations that receive the shipments and the amount of shipments theyreceive is then returned, ordered by the number of shipments.

6.3.2.2 Find Biggest Organisations in an Area

This query tries to finds the organisations that send the highest amount of ship-ments to a specific area. A consignor belongs to an organisation, and sends ashipment that’s picked up at a delivery point, which in turn is located at a loca-tion; a ZipArea.

MATCH ( o : O r g a n i s a t i o n ) <−[:BELONGS TO]−( c r : Cons igno r ) −[:SENT]−>( s : Shipment )−[:PICKED UP AT]−>(d : D e l i v e r y P o i n t )−[:LOCATED AT]−>(z : ZipArea )

The query then needs to be bound to a specific area. The found organisations alsoneeds to be sorted by the amount of shipments that they send.


WHERE z . i d =”X”RETURN o . name , c o u n t ( s ) AS sum ORDER BY sum DESC

In the final version of the query, X represents the ZipArea’s id. The name of theorganisations are returned, coupled with the amount of shipments they send to thespecific area. The orgganisations are ordered by the amount of shipments theysend in a descending order.

6.3.2.3 Find Consignees’ Consignees

The goal of this query is to find all consignees of an organisation’s consignees. Asa start, all consignees of an organisation must be found.

6.3. ANALYSIS 45

MATCH ( o1 : O r g a n i s a t i o n ) <−[:BELONGS TO]−( c1 : Cons igno r ) −[:SENT]−>( s1 : Shipment ) −[:TO]−>( ce1 : Cons ignee ) −[:BELONGS TO]−>(o : O r g a n i s a t i o n )

WHERE o1 . name=”X” AND NOT o . name=”X”

Once all receiving organisations have been identified, their consignees can in turnbe found by adding to the query. Once those consignees have been identified,they can be returned in a descending order, ordered by the amount of receivedshipments.


WHERE o1 . name=”X” AND NOT o . name=”X”WITH o , o1MATCH ( o : O r g a n i s a t i o n ) <−[:BELONGS TO]−( c : Cons igno r )−[:SENT]−>( s : Shipment ) −[:TO]−( ce : Cons ignee )−[:BELONGS TO]−>(o2 : O r g a n i s a t i o n )

WHERE NOT o2 . name=o . nameRETURN o1 . name AS s e n d i n g o r g a n i s a t i o n , c o u n t ( s ) AS

sum , o2 . name AS r e c e i v i n g C o n s i g n e eORDER BY sum DESC

In the final version of the query, X represents the name of the sending or-ganisation. The name of the sending organisation is returned, coupled with theconsignees’ consignees and the amount of shipments they receive.

Chapter 7

Graph Database ManagementSystem for a Logistics-relatedService

In this chapter the results of the thesis are presented. The chapter is divided intofive separate parts that present and discusses the results of the finished prototypeservice, presents the feedback of the different aspects of the prototype, provideanalysis of validity and reliability and finally a section that discusses the resultspresented in this chapter.

7.1 Prototype Results

This section presents the final version of the prototype and provides a detailedinsight into every major aspect of the prototype service. Example outputs areprovided in order to give a greater perspective of how the implemented queriesmight work in a normal situation.

7.1.1 Model

In this section the final version of the model is presented. The final version isshown in Figure 6.3. It represents the different aspects of the system as well astheir relations to each other. The nodes’ names have previously been explainedin Section 5.2, whereas the relations’ labels describe the relations’ types. Table7.1 details the different properties of the nodes specified in Figure 6.3. The moredifficult terms are explained in Table 5.1.

47

48CHAPTER 7. GRAPH DATABASE MANAGEMENT SYSTEM FOR A

LOGISTICS-RELATED SERVICE

Table 7.1: Node types and their corresponding properties.

Trade Organisation Consignor ShipmentID ID ID ID

name name weightsource

weightunitvolumeserviceoETAATA

Consignee Party DeliveryPoint ZipAreaID ID ID ID

name name name lonemail zip lat

number typecountry

7.1.2 Analysis

This section provides the finished queries as well as example query results. Notethat names and other important parameters have been edited in order to protectprivate information.

7.1.2.1 Find Shipment by Kolli-id

Appendix B.1 contains the final version of the query. Figure 7.1 showcases anexample output of the query; illustrating the connection from the Consignor, to aShipment, to a Consignee and finally to the Consignee’s Party. As a time refer-ence, the example query took around 16 seconds to run.

7.1. PROTOTYPE RESULTS 49

Figure 7.1: A Shipment node with its relations and properties.

7.1.2.2 Find Shipments for a Consignee

Appendix B.2 contains the final version of the query. Figure 7.2 showcases anexample output of the query; showing the Consignee, its Party, as well as all itsShipments. As a time reference, the example query took around 8 seconds to run.

Figure 7.2: All Shipments sent to a Consignee.



7.1.2.3 Find Shipments for a Consignee from a Consignor

Appendix B.3 contains the final version of the query. Figure 7.3 showcases anexample output of the query; showing all shipments between a Consignor and aConsignee. Lastly, the Consignee’s party is also shown. As a time reference, theexample query took around 0.5 seconds to run.

Figure 7.3: Two Shipments sent from a Consignor to a Consignee that has a Party.

7.1.2.4 Find Top 10 Suitable Delivery Points for a Consignee

Appendix B.4 contains the final version of the query. Figure 7.4 showcases anexample output of the query; showing the 10 most suitable delivery points, underthe DeliveryPoint name in the figure, based on previous deliveries, seen underDeliveries in the figure. As a time reference, the example query took around 0.2seconds to run.


Figure 7.4: Top 10 Delivery Points by delivered Shipments for a Consignee.

7.1.2.5 Find all Shipments Sent to Organisations

Appendix C.1 contains the final version of the query. Figure 7.5 showcases anexample output of the query; showing all shipments sent from one Organisationto others, named under organisaion in the figure. The number of shipments sentto a specific organisation is documented under the variable sent in the figure. Asa time reference, the example query took around 1 seconds to run.



Figure 7.5: The top 10 Organisations that an Organisation has sent Shipments to,ordered by the amount of Shipments.

7.1.2.6 Find Biggest Organisations in an Area

Appendix C.2 contains the final version of the query. Figure 7.6 showcases anexample output of the query; showing the biggest Organisations in a specific area(ZipArea). The organisations are listed under the Organisation variable and theamounts of shipments are listed under the shipments variable. As a time reference,the example query took around 0.5 seconds to run.


Figure 7.6: Organisations that received the highest amount of Shipments in anArea.

7.1.2.7 Find Consignees’ Consignees

Appendix C.3 contains the final version of the query. Figure 7.7 showcases anexample output of the query; showing the Consignees’ Consignees of a specificOrganisation. In the figure, the sending organisation is listed under the send-ing organisation variable, the amount of shipments under the sum variable andthe receiving organisation under the receivingConsignee variable.

As a time reference, the example query took around 10 seconds to run.



Figure 7.7: The Consignees of an Organisation’s Consignees, ordered by theamount of Shipments.

7.2 Interview ResultsThis section details the results of the unstructured interviews with PostNord em-ployees related to the various aspects of the prototype service. All major aspectsof the prototype are evaluated, including the final version of the prototype serviceas a whole. The feedback is presented in a bullet-point style.

7.2.1 Graph Data ModelIn this section all the feedback from the interviews related to the graph data modelis presented.

• The model is very intuitive and it is easy to understand how the differentaspects of the system are connected

• The model could be used to teach new employees and the like how differentsystems and networks function

• By looking at the model it could be easier to invent new and interestingqueries

7.2. INTERVIEW RESULTS 55

• By looking at the model it could be easer to create more efficient queries

7.2.2 AnalysisIn this section feedback of both queries related to customer support and B2B ispresented. Feedback does not only touch upon the finished queries or their output,but also the query creation processes.

7.2.2.1 Customer Support

This section presents all the feedback from the interviews related to customersupport queries.

• Simple implementations, fast runtimes

• Easy to build a service around Neo4j in regards to customer support

• Graph interface intuitive; easy for both customer and employee to under-stand

• Seemingly easier to implement and use than in existing systems

• Certain queries take to long; such as only searching by Kolli-ID.

7.2.2.2 B2B

This section presents all the feedback from the interviews related to B2B queries.

• Simple queries seem to have fast run times and provide good results

• Still interesting, but not as interesting as in regards to customer support

• The query about Consignees’ consignees indicates that more complex anal-ysis on a larger scale might have poor performance

7.2.3 Consumer Profiling SystemIn this section all feedback from the interviews related to the prototype system asa whole is presented.

• Using Neo4j to create this system rather than an RDBMS could make iteasier and faster to run certain queries related to customer support issues

• RDBMSs have many positive aspects, but implementing these kinds ofqueries would be a lot harder than in Neo4j



• I can do all of this using an RDBMS

• The main focus is its intuitivity; specifically the graph data model is verypowerful

• The query language and the graph DBMS make it very easy to add to orchange the system, which could be harder to do if using an RDBMS

• Due to the simple setup process and intuitive it would be a perfect tool forprototyping and the like

7.3 Reliability Analysis

This section analyses the reliability of the prototype service. In Section 5.7.1 twoaspects were identified that needed to be assessed. These were the graph datamodel and the query results. The interviewed employees with knowledge of thelogistics network confirmed that the graph data model created in the degree projectcorrectly reflects their logistics network and covers every possible scenario. Fur-thermore, the interviewed employees also agreed that the queries were seeminglyreliable, based on the fact that the same output was given for every specific input.

7.4 Validity Analysis

In this section the validity of the prototype service is analysed. In Section 5.7.2two aspects of the degree project were identified that needed to be validated. Thesewere the graph data model of the logistics network and the query results createdin Cypher.

In regards to the graph data model, key employees at PostNord with insight intothe logistics network confirmed that the model correctly portrays the logistic net-work and that it does not stray from the documentation of the network.

As for the query results, the interviewed employees confirmed that the outputof the queries is correct, which is validated by assessing the relations of the rawdata used in Neo4j, and that they function as per the definition of the queries thatwere provided.

7.5. DISCUSSION 57

7.5 Discussion

In this section the results of the interviews are discussed. Because the evaluationmethod of this thesis is based on opinions, rather than empirical data, the resultsare highly depending on the situation and the interviewed employees, meaningthat the reaction to the prototype might vary greatly depending on who is inter-viewed. As such the the analysis of Neo4j in this section should only be seen asan indication of how things might be, rather than as fact.

7.5.1 Model

The model turned out to be one of the most interesting aspects aspects of Neo4j,according to the interviewed employees at PostNord. Interviews showed that thefeedback was overwhelmingly positive; some even suggesting that the model it-self could be used in other parts of their organisation.

The key aspect of the model seems to be its intuitivity and structure, possiblymaking it easier to craft new queries and come up with new ideas.

7.5.2 Analysis

This section discusses the results of the interviews in regards to the analysis pro-cess; covering both queries related to customer support and queries related toB2B.

7.5.2.1 Customer Support

The customer support queries were mostly quite simple in nature and turned outto perform adequately, as per the interviewed employees. Many were also im-pressed by the intuitivity of the query language; pointing out that it seems to bequite dynamic and easy to adapt to the situation. The common positive threadseemed to be that it was easy to set up simple graph traversal queries using Neo4j,meaning that the development time potentially could be decreased for some typeof projects.

7.5.2.2 B2B

Queries related to B2B also arose some interest, but there was some doubt aboutthe performance of Neo4j in more advanced queries. The simple queries imple-mented for the customer support aspect were simple in nature and were as such



very intuitive and simple to implement. On the other hand, larger and more ad-vanced queries related to B2B showed that the query creation process quicklyturns somewhat ambiguous and indecipherable.

7.5.3 Consumer Profiling SystemFor the system as a whole, many interviewed employees were impressed by theuses of Neo4j, specifically in regards to its intuitivity and the simplicity of creatingsimple queries. Not all comments were positive though; some advocated that thesystem as a whole is redundant as everything that can be implemented in Neo4jalso can be created in a traditional RDBMS.

Many viewed it positively as a system to use for prototyping and creating smallerand simpler systems such as customer support services. At the same time, manyhad doubts as to its performance and intuitivity in more advanced situations; suchas when implementing B2B tools and the like.

An aspect that most of the interviewed employees seemed to agree on was thefact that Neo4j seemed to be more dynamic than an RDBMS, potentially makingit easier to use in systems where the data model changes often, or in cases when asystem regularly needs to upgrade or change its functionality in some way.

Chapter 8

Conclusions and Future work

Of the three goals presented in this thesis, all three were reached. A prototypeof a logistics-related service was created using the graph DBMS Neo4j, and thecreation process has been explained in detail throughout the thesis. Furthermoremany positive and negative aspects of Neo4j and insights into graph DBMSs havebeen presented and analysed.

To create the prototype service of the logistics network the graph DBMS Neo4jwas used. Relevant information was taken from existing databases at PostNordand was used to create the database, after which seven queries were created, fourrelated to customer support and three related to B2B, which were defined in thespecification of requirements.

After the prototype service had been finalized, interviews with PostNord employ-ees and further analysis provided some insights as well as positive and negativeaspects of Neo4j and graph DBMSs as a whole.

Neo4j was identified as an intuitive and easy to use DBMS which primarily couldbe useful for prototyping and smaller systems. Neo4j, and potentially other graphDBMSs, is as such a potential alternative to RDBMSs in the area of graph net-works, but more research would be needed in this area due to the evaluationmethod used in this thesis.

One of the most important insights gained from this work is the need to evaluatethe graph DBMS in a variety of situations. The prototype mainly included queriesrelated to graph traversal, though queries that required more complex analysispatterns indicated that Neo4j does not perform equally in such situations.

59

60 CHAPTER 8. CONCLUSIONS AND FUTURE WORK

8.1 Discussion

Given the context of the created prototype, using the graph DBMS Neo4j seem-ingly provided many positive aspects, but also some negative ones. The graphdata model was viewed as one of the most positive aspects. Its intuitivity wasidentified to be its main strength: some even argued that the model could be usedfor other aspects of their organisation in order to simplify the learning curve fornew employees.

All aspects were not positive though, as Neo4j did not perform adequately forall types of queries. Neo4j’s main strength is clearly graph traversal, but othertypes of analyses generally see a drop in performance, leading to long query runtimes.

A general consensus of the interviewed employees was that Neo4j could be usedto great success for prototyping or when implementing smaller and simpler ser-vices due to its flexibility and intuitivity.

Due to the scope of the project the evaluation of the finished product is solelybased on opinions from a handful of people. This means that the end result andinsights gained from this thesis are highly subjective and are not backed up byany kind of statistical data or analysis. Furthermore, the logistics-network that theservice is based on is a graph problem, and most queries are focused on graphtraversals, an area highly connected to graphs. As such the thesis mainly ad-dresses a situation that highly suits this type of DBMS, which limits the findingssince many systems could require queries that traditionally would not fit a graphDBMS like Neo4j.

8.2 Future work

For future work, a comparison analysis would be recommended. Although Neo4jproved to be a flexible and usable DBMS for the consumer profiling system it ishard to evaluate it in detail without a thorough comparison to other DBMSs. Thenext step would therefore be to compare the prototype from the project to othersystems, such as RDBMSs, with the same functionality. Based on performanceevaluations and other aspects clear distinctions could be made, which would makeit easier to draw conclusions as to the positive and negative aspects of Neo4j.

It would be interesting to see an analysis of different types of queries and howwell Neo4j handles them. As noted in this thesis and the two related works spec-

8.2. FUTURE WORK 61

ified in Section 2.4, Neo4j’s performance varies greatly depending on the typeof query, and it would be interesting to see if the problems with some queries inNeo4j are consistent for all graph DBMSs. Neo4j generally seems to have a goodperformance in regards to graph traversals, but most systems generally consist ofother types of queries as well, meaning that Neo4j’s performance in other typesof situations is hugely important.

Due to the limited time frame of the degree project many things could not beincluded in the thesis and the underlying degree project. The three most impor-tant aspects that were left out were the cost analysis, comparison analysis and thesecurity analysis. The cost of using the graph DBMS Neo4j could prove to be avery important factor for many companies and organisations who consider usingNeo4j. Furthermore, comparing Neo4j to other graph DBMSs and other types ofDBMSs is also hugely important in order to see its performance relative to otherDBMSs. Lastly, having a safe and secure system is also of the utmost importance.A security analysis would as such be needed in order to verify that the system issafe to use.

Chapter 9

Summary

In this thesis a prototype of a logistics-related service using the graph databasemanagement system Neo4j is created. The thesis had three goals, which were: 1)develop a prototype of a logistics-related service using a graph database manage-ment system, 2) showcase how such a prototype could be created, and 3) analysethe end product and identify some of the potential strengths and weaknesses of thegraph database management system. By creating a service using a graph databasemanagement system and showcasing its creation process as well as pinpointingsome of the positive and negative aspects of the system, its possible that somecompanies and other entities facing similar graph-related issues would becomeinterested in exploring different graph database management systems.

A qualitative research methodology was used, coupled with interpretivism as theresearch paradigm and an inductive research approach. An exploratory researchdesign was chosen, and data was collected in the form of unstructured interviews.The data was then analysed using the analytic induction method, followed by thedependability method for quality assurance.

By analyzing the results obtained in the unstructured interviews, multiple positiveand negative aspects were identified. The data model was viewed overwhelminglyfavourably; mainly due to its intuitivity. Furthermore, Neo4j was seen as a greatplatform for simple graph-related services and general prototyping, but some ofthose interviewed were skeptical as to if it would be as useful in large and complexsystems or services.

63

References

[1] Margaret Rouse, ‘What is database?’, Tech Target. Apr-2006 [Online]. Avail-able: http://searchsqlserver.techtarget.com/definition/database. [Accessed:12-Jun-2016]

[2] Christof Strauch, ‘NoSQL databases’, Stuttgart Media University, 2011[Online]. Available: http://webpages.uncc.edu/xwu/5160/nosqldbs.pdf. [Ac-cessed: 20-Jul-2016]

[3] Charles H. Silver, ‘Tables Are Dead: How to Overcome Rela-tional Model Limitations’, eWeek. 20-Apr-2010 [Online]. Available:http://www.eweek.com/c/a/Database/Tables-are-Dead-How-to-Overcome-Relational-Model-Limitations. [Accessed: 06-Apr-2016]

[4] Neo Technology, Inc., ‘Neo4j Customer Success Stories and Case Studies’,2016. [Online]. Available: https://neo4j.com/customers/. [Accessed: 05-Apr-2016]

[5] Olof Bjuro, Lars Forslund, Anders Hildingsson, Joakim Levin, Par Lind-berg, Emma Maraschin, and Gabriel Rhawi, ‘The Swedish Postal Ser-vices Market 2015’, PTS-ER-2015:3, Apr. 2015 [Online]. Available:https://www.pts.se/upload/Rapporter/Post/2015/The%20Swedish%20Postal%20Services%20Market%202015.pdf. [Accessed: 06-Apr-2016]

[6] Anne Hakansson, ‘Portal of research methods and methodologies forresearch projects and degree projects’, in Proceedings of the Inter-national Conference on Frontiers in Education: Computer Scienceand Computer Engineering (FECS), 2013, p. 1 [Online]. Available:https://www.kth.se/social/files/55563b9df27654705999e3d6/Research%20Methods%20-%20Methodologies%281%29.pdf. [Accessed: 16-May-2016]

[7] Reference, ‘What is the meaning of “database system”?’, Reference.2016 [Online]. Available: https://www.reference.com/technology/meaning-database-system-b6d11196cfd749b8#. [Accessed: 18-Jun-2016]

65

66 REFERENCES

[8] Margaret Rouse, ‘What is database management sys-tem (DBMS)?’, Tech Target. Jan-2015 [Online]. Available:http://searchsqlserver.techtarget.com/definition/database-management-system. [Accessed: 29-Jun-2016]

[9] solid IT gmbh, ‘DB-Engines Ranking - popularity ranking of database man-agement systems’, DB-Engines. May-2016 [Online]. Available: http://db-engines.com/en/ranking. [Accessed: 03-May-2016]

[10] Edgar Frank ‘Ted’ Codd, ‘A relational model of data for large shared databanks’, Communications of the ACM, vol. 13, no. 6, pp. 377–387, 01-Jun-1970.

[11] Rickard J. Trudeau, Introduction to Graph Theory, 2nd ed. New York: DoverPublications, Inc., 1993.

[12] Neo Technology, Inc., ‘Company - Neo4j Graph Database’, 2016. [Online].Available: https://neo4j.com/company/. [Accessed: 12-Apr-2016]

[13] Neo Technology, Inc., ‘Neo4j Graph Database: Unlock the Value of DataRelationships’, 2016. [Online]. Available: https://neo4j.com/product/. [Ac-cessed: 12-Apr-2016]

[14] Marko Rodriguez, ‘MySQL vs. Neo4j on a Large-Scale Graph Traver-sal - DZone Database’, Database Zone. 05-Dec-2011 [Online]. Available:https://dzone.com/articles/mysql-vs-neo4j-large-scale. [Accessed: 12-Apr-2016]

[15] Curtis Mosters, ‘OrientDB vs Neo4j - Comparison ofquery/speed/functionality’, 28-Jan-2015 [Online]. Available:http://www.slideshare.net/kwoxer/orientdb-vs-neo4j-comparison-of-queryspeedfunctionality. [Accessed: 14-Apr-2016]

[16] Neo Technology, ‘8.1. What is Cypher? - - The Neo4j Manual v2.3.3’. [On-line]. Available: http://neo4j.com/docs/stable/cypher-introduction.html. [Ac-cessed: 11-May-2016]

[17] Neo Technology, ‘Chapter 3. Cypher query language - The Neo4j DeveloperManual v3.0’, 2016. [Online]. Available: http://neo4j.com/docs/developer-manual/current/cypher/. [Accessed: 11-May-2016]

[18] solid IT gmbh, ‘DB-Engines Ranking - popularity ranking ofgraph DBMS’, DB-Engines. Jul-2016 [Online]. Available: http://db-engines.com/en/ranking/graph+dbms. [Accessed: 21-Jul-2016]

REFERENCES 67

[19] solid IT gmbh, ‘Titan System Properties’, DB-Engines. Jul-2016 [Online].Available: http://db-engines.com/en/system/Titan. [Accessed: 21-Jul-2016]

[20] Dan LaRocque, Daniel Kuppitz, Matthias Broecheler, Marko A. Ro-driguez, Stephen Mallette, and Vadas Gintautas, Titan: Distributed GraphDatabase. Aurelius, 2015 [Online]. Available: http://titan.thinkaurelius.com/.[Accessed: 20-Jul-2016]

[21] C. Vicknair, M. Macias, Z. Zhao, X. Nan, Y. Chen, and D. Wilkins, ‘AComparison of a Graph Database and a Relational Database: A Data Prove-nance Perspective’, in Proceedings of the 48th Annual Southeast RegionalConference, New York, NY, USA, 2010, p. 42:1–42:6 [Online]. Available:http://doi.acm.org/10.1145/1900008.1900067

[22] David Hoksza and Jelynek, ‘Using Neo4j for Mining Protein Graphs:A Case Study’, presented at the 2015 26th International Workshop onDatabase and Expert Systems Applications (DEXA), 2015 [Online]. Avail-able: http://ieeexplore.ieee.org/document/7406298/.

[23] Michael D. Myers, Qualitative Research in Business &Management, 2nd ed. SAGE, 2013 [Online]. Available:https://www.google.com/books?hl=sv&lr=&id=XZARAgAAQBAJ&oi=fnd&pg=PP2&dq=Michael+Myers+Qualitative+research+in+business+and+management&ots=C9QJpp-e9c&sig=NeeUEWIUciO3iiMsQ67M0ob0vdk.[Accessed: 16-May-2016]

[24] Neil J. Salkind, Exploring Research, 8th ed. Pearson, 2011.

[25] Office of Information Services, ‘Selecting a Development Approach’.Centers for Medicare & Medicaid Services, 27-Mar-2008 [Online].Available: https://www.cms.gov/research-statistics-data-and-systems/cms-information-technology. [Accessed: 30-Aug-2016]

[26] Neo Technology, Inc., ‘The Neo4j Browser: A User Interface Guide for Be-ginners’, neo4j. 2016 [Online]. Available: https://neo4j.com/developer/guide-neo4j-browser/. [Accessed: 29-Jul-2016]

[27] James Gosling, Bill Joy, Guy L. Steele, Gilad Bracha, and Alex Buckley,‘The Java Language Specification’. Oracle America, Inc, 13-Feb-2015 [On-line]. Available: https://docs.oracle.com/javase/specs/jls/se8/jls8.pdf. [Ac-cessed: 29-Jul-2016]

[28] Trello, Inc., ‘Trello’, 2016. [Online]. Available: https://trello.com/. [Ac-cessed: 29-Jul-2016]

Appendix A

Database Creation

neo4j−i m p o r t −− i n t o l o g i s t i c s . db −−id−t y p e s t r i n g \−−nodes z i p A r e a s . csv \−−nodes c o n s i g n o r s . c sv \−−nodes c o n s i g n e e s . csv \−−nodes s h i p m e n t s . c sv \−−nodes d e l i v e r y P o i n t s . c sv \−−nodes p a r t y . csv \−−nodes o r g a n i s a t i o n . csv \−−nodes t r a d e . csv \−− r e l a t i o n s h i p s :TO t o . csv \−− r e l a t i o n s h i p s : SENT s e n t . c sv \−− r e l a t i o n s h i p s : BELONGS TO b e l o n g s t o . csv \−− r e l a t i o n s h i p s :HAS has . csv \−− r e l a t i o n s h i p s : IN i n . csv \−− r e l a t i o n s h i p s : LOCATED AT l o c a t e d a t . c sv \−− r e l a t i o n s h i p s : PICKED UP AT p i c k e d u p a t . c sv \

69

Appendix B

Customer Support

B.1 Find Shipment by Kolli-id


WHERE s . i d =”X”RETURN cr , s , ce , p

B.2 Find Shipments for a Consignee


WHERE ( p . e m a i l =”X” OR p . mobi le =”Y” ) AND p . name=”Z”RETURN s , ce , p

B.3 Find Shipments for a Consignee from a Con-signor


WHERE ( p . e m a i l =”X” OR p . mobi le =”Y” ) AND p . name=”Z”AND c r . name=”W”

RETURN cr , s , ce , p

71

72 APPENDIX B. CUSTOMER SUPPORT

B.4 Find Top 10 Suitable Delivery Points for a Con-signee

MATCH ( s : Shipment ) −[:TO]−>( ce : Cons ignee ) ,( s : Shipment ) −[:PICKED UP AT]−>(dp : D e l i v e r y P o i n t )

WHERE ce . name=”X”RETURN dp . name , c o u n t ( s ) ORDER BY c o u n t ( s ) DESC LIMIT

10

Appendix C

Business to Business

C.1 Find all Shipments Sent to Organisations


WHERE NOT o . name=o2 . name AND o . name=”X”RETURN c o u n t ( s ) AS s e n t , o2 . name AS o r g a n i s a t i o nORDER BY s e n t DESC

C.2 Find Biggest Organisations in an Area


WHERE z . i d =”X”RETURN o . name , c o u n t ( s ) AS sum ORDER BY sum DESC

73

74 APPENDIX C. BUSINESS TO BUSINESS

C.3 Find Consignees’ Consignees


WHERE o1 . name=”X” AND NOT o . name=”X”WITH o , o1MATCH ( o : O r g a n i s a t i o n ) <−[:BELONGS TO]−( c : Cons igno r )−[:SENT]−>( s : Shipment ) −[:TO]−( ce : Cons ignee )−[:BELONGS TO]−>(o2 : O r g a n i s a t i o n )

WHERE NOT o2 . name=o . nameRETURN o1 . name AS s e n d i n g o r g a n i s a t i o n , c o u n t ( s ) AS

sum , o2 . name AS r e c e i v i n g C o n s i g n e eORDER BY sum DESC

TRITA TRITA-ICT-EX-2016:156

www.kth.se

A graph database management system for a logistics-related ...

Documents

Transcript of A graph database management system for a logistics-related ...