Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks....

44
Master Thesis: Progress Report 3 Laurens De Vocht Master in Computer Science Engineering Master in de ingenieurswetenschappen: computerwetenschappen Subject: Scientific Profiling based on Semantic Analysis in Social Networks Supervisors: Dr. Martin Ebner Prof. Dr. Erik Duval Promotors: Prof. Dr. Erik Duval Prof. Dr. Nick Scerbackov Academic year 2010 – 2011

Transcript of Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks....

Page 1: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

Master Thesis: Progress Report 3

Laurens De Vocht

Master in Computer ScienceEngineeringMaster in de

ingenieurswetenschappen:computerwetenschappen

Subject:Scientific Profiling based onSemantic Analysis in Social

Networks

Supervisors:Dr. Martin Ebner

Prof. Dr. Erik Duval

Promotors:Prof. Dr. Erik Duval

Prof. Dr. Nick Scerbackov

Academic year 2010 – 2011

Page 2: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

© Copyright K.U.Leuven

Without written permission of the promotors and the authors it is forbidden to repro-duce or adapt in any form or by any means any part of this publication. Requests forobtaining the right to reproduce or utilize parts of this publication should be addressedto het Departement Computerwetenschappen, Celestijnenlaan 200A bus 2402, B-3001Heverlee, +32-16-327700 of via e-mail [email protected].

A written permission of the promotor is also required to use the methods, products,schematics and programs described in this work for industrial or commercial use, andfor submitting this publication in scientific contests.

Zonder voorafgaande schriftelijke toestemming van zowel de promotor(en) als de au-teur(s) is overnemen, kopiëren, gebruiken of realiseren van deze uitgave of gedeeltenervan verboden. Voor aanvragen tot of informatie i.v.m. het overnemen en/of gebruiken/of realisatie van gedeelten uit deze publicatie, wend u tot the Departement Compu-terwetenschappen, Celestijnenlaan 200A bus 2402, B-3001 Heverlee, +32-16-327700 orby email [email protected].

Voorafgaande schriftelijke toestemming van de promotor(en) is eveneens vereist voorhet aanwenden van de in deze masterproef beschreven (originele) methoden, produc-ten, schakelingen en programma’s voor industrieel of commercieel nut en voor de in-zending van deze publicatie ter deelname aan wetenschappelijke prijzen of wedstrijden.

Page 3: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Samenvatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Figures and Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Literature Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 A network of linked data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Where it all started 5, How the social web can be interlinked 6, Whichlayers the semantic web consists of 6, What semantic profiling is about 7

2.2 Social networks in this decade . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Where the object centered sociality went 8, How online communities canbe interlinked 8

2.3 A story told in triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9How semantic microblogging with Twitter could work 9, What a semanticmicroblogging architecture should look like 9, Another case of datatransformation 10, How mining microblogs using semantic technologiescan be done 11, Semantic Web Pipes for Semantic Mash-Ups 11

2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Design specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Extraction Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Interlinking Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Analysis Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.5 Web Service Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

i

Page 4: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

CONTENTS

4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Extraction Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Interlinking Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 Analysis Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Project Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Previous iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Iteration 1 28, Iteration 2 28, Iteration 3 28, Iteration 4 and 5 28,Iteration 6 29, Iteration 7 29

5.3 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Notes 30, Changes 30

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

ii

Page 5: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

Abstract

This the third report of a Master Thesis project in Computer Sciences at Graz University ofTechnology (TUGraz) and the Katholieke Universiteit Leuven (KULeuven). It is an overviewon the first part of the research. It gives a background overview to situate this report and itdiscusses the problem statement. An in depth view on the software architecture is given.Some noteworthy implementation details are revealed. Finally an updated project plan ismotivated.

iii

Page 6: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

Samenvatting

Dit is het derde verslag dat kadert in een masterproef in de ingenieurswetenschappen:computerwetenschappen aan de Katholieke Universiteit Leuven (KULeuven) en de Techni-sche Universitat Graz (TUGraz). Eerst schetsen we een achtergrondkader bij dit verslag enherformuleren we nauwkeurig de probleemstelling. Er is een gedetailleerd overzicht vande architectuur van het framework. We onthullen enkele opvallende implementatiedetails.Ten slotte motiveren we de huidige versie van het projectplan.

iv

Page 7: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

List of Figures and Tables

List of Figures

1.1 Main use case diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Walls between social networks as presented by Tim Berners-Lee. . . . . . . . . 62.2 Three layers of the Semantic Web by Peter Mika . . . . . . . . . . . . . . . . . . 72.3 A triple by Peter Morville . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Exit to the semantic web. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 The semantic profiling framework design updated. . . . . . . . . . . . . . . . . 143.2 The extraction layer represented as a package. . . . . . . . . . . . . . . . . . . . 153.3 Detail of the interlinking module . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 The analysis layer represented as a package. . . . . . . . . . . . . . . . . . . . . . 183.5 Demo of the matching between two twitter users based on hashtag similarities 193.6 User Profile JSON Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.7 People Discovery based on specific User Profile only JSON response . . . . . . 20

4.1 W3C validation of a twitter user RDF/XML produced in the extraction layer . . 244.2 SPARQL Example Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3 SPARQL Query Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Not optimal URI, it does not refer to an RDF resource . . . . . . . . . . . . . . . 264.5 Better URI, refers to a local URI with RDF representation . . . . . . . . . . . . . 26

List of Tables

5.1 The research schedule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

v

Page 8: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.
Page 9: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

Chapter

1Introduction

This introduction gives a background overview for this Master Thesis as well as a definitionof the problem. The purpose and the scope of this report are outlined.

1.1 Background

One of the most visible trends on the internet is the emergence of “Social Web” sites.Current online community sites are isolated from one another. The main reason for thislack of interoperability is the fact that common standards for data interchange still haveto arise. The Semantic Web Technology stack is well defined and applying frameworkssuch as SIOC (Semantically Interlinked Online Communities) [2] and FOAF (Friend-Of-A-Friend) [1] can lead to a an interlinked and semantically rich knowledge source. Thisknowledge source will be built with user profiles and the content they produce on varioussocial networks as a basis. To achieve this one process reoccurs in many projects It isrealized in three steps that have their own specific tools that aid in the implementation.The first step is referred to as “triplification” or “rdfization” [10][17], data is extracted andannotated with the help of domain vocabularies and ontologies. The triples that resultare then being stored and made accessible as linked datasets in the second step. The thirdand final step in the process is the publication of the data URIs in various RDF formats oras a SPARQL endpoint. More information is contained in the first thesis report. [6]

We propose a framework to address an important issue in the context of the ongoingadoption of the “Web 2.0” in science and research, often referred to as “Science 2.0” or“Research 2.0”. A growing number of people are linked via acquaintances and onlinesocial networks such as Twitter allow indirect access to a huge amount of ideas. Theseideas are contained in a massive human information flow. That users of these networksproduce relevant data is being shown in many studies. The problem however lies indiscovering and verifying such a stream of unstructured data items. Another relatedproblem is locating an expert that could provide an answer to a very specific researchquestion. [4]

At Graz University of Technology a tool called Grabeeter [13] has been implemented forstoring and caching social data from Twitter. This report describes an implementationof the triplification process for Grabeeter. This project take part in an effort to providea scientific architecture paradigm for building semantic applications that rely on social

1

Page 10: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

1. INTRODUCTION

data [19]. Furthermore the report describes the architecture for the framework that weare building on this semantic social data layer[18]. It aims to gain more knowledge andmine usable data out of the social context of microblogs. Selver Softic is developing alinked data repository called Colinda. This repository will be used to link authortagsto scientific conferences. The implemented use case to test this application will be theend-user usability of semantically enriched researcher profiles. According to semanticanalysis of researchers‘ tweets, research fields are carried out and connected with theirworking places (institutions or companies) as well as conferences. Since we are mininga social data source where scientific significance is essential, we can call this analysis“Scientific Profiling”.

The results of the social data triplification are still not proven accessible to researcherswithout the need for an expert with extended information mining skills. Semantic profilingcould help in the understanding of their scientific relevance and importance to anotherresearchers specific needs; this is because the extent of social network data is massiveand individual researchers are only likely to be interested in specific parts of the overallknowledge on the basis of their area of specialization. Online social connections can bebuild around common entities that the users link to. At the same time it will create newopportunities to correlate existing Research 2.0 integration efforts and applications.

1.2 Problem statement

The goal is to build a semantic profiling framework that can support applications andservices that try to improve the connecting of researchers.

The main use case and application that the framework has to support is illustratedby what could be called: “the conference case” as shown in Figure 1.1. Scientists andresearchers are interested in very specific topics, this is best verified by the conferencesthey are attending. Another trend is that they all blog and tweet about these events[16][8].This creates huge opportunities for profiling. The attendees tweet about what they notice,what they remark as interesting for their own projects. What if we could connect theseusers using this information? We could call an application that does just that “ScientificProfiling”. This approach comes from the concept that the data produced in social net-works can have true value if properly annotated and interlinked [3]. A second requirementis to create a suitable context in which this information can get meaning. This is veryimportant to identify which ontologies should be used.

Turning the concept around can lead to another use case: “the looking-for-interesting-people case”. Suppose a scientist wants to find either: interesting event (which manypeople in his subject are going to); people (based on matching interests or events); newchallenges (companies, organization, topics which are related to events and topics this sci-entist is interested in).This application is also “Scientific Profiling”. Only now approachedfrom the user perspective and not from a data perspective.

2

Page 11: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

1.3. Purpose

Researchers

Scientific Profiling

Researcher (User)

User Model Event ModelScientific Conferences Resource

Profiler/Analyzer

FIGURE 1.1: Main use case diagram

1.3 Purpose

This report is primarily intended for the supervisors and promotors of this Master thesis.Also everybody who is interested in the semantic web, microblogging and profiling mightfind some parts of this report relevant.The next chapters give an overview of the current architecture and show more insightinto some important implementation details. Changes to the previous architecture areexplained and motivated.

1.4 Scope

It is to be noted that the software architecture doesn’t describe how the framework couldsupport a broad set of domains. It remains focussed on the integration of user data fromTwitter and domain knowledge about scientific conferences. As stated in the previousreport [6], it is being designed only with the problem statement in mind. At this time it isnot part of the research to find out how this could be extended to other resources (besidesTwitter and Colinda) or targets (e.g. mobile applications). This report is limited to thedevelopment and implementation in the second four weeks of the project. It also includessome important implementation details.

3

Page 12: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.
Page 13: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

Chapter

2Literature Study

An overview of the most important articles is given. The articles are presented in thefollowing order: first articles handle the semantic web in general, then some cases ofmicroblogging combined with semantics are discussed and finally this chapter presentsa commented summary of some ideas that really support this specific case of scientificprofiling in social networks.

2.1 A network of linked data

The semantic web represents a network of linked data. This data can be of any kind. It allstarted as a vision by World Wide Web guru Tim Berners-Lee. Since it was first introducedin 2001 the discussions have never stopped. There are those that claim it will disappear asslowly as it got popular, are against those, that ensure it will creep into all known-to-dayweb services. Ultimately the entire world wide web could form a huge semantic web.However interesting a study of the holistic view and the developments of its widespreadreputation might be, it is not relevant at all for this project. It is more of interest to takea look at what is out there and which semantic web projects and tools can support theframework for the semantic profiling application.

2.1.1 Where it all started

Every study about the semantic web should include the very paper of Berners-Lee et al.published May 2001 in Scientific American[11]. In the article they presented the semanticweb as a new form of web content meaningful to computers. They believed, and still dotoday, that it will unleash a revolution of new possibilities. The authors started with anexample of the scheduling of an appointment by two busy persons. They both used thehelp of their software agents. Those agents were able to help them by being able to identifyevents, times and locations in their messages and link them to both their schedules. Theauthors called this concept: the Semantic Web.

The Semantic Web differs from the World Wide Web in the sense that it will bring struc-ture to the meaningful content of Web pages, creating an environment where softwareagents roaming from page to page can readily carry out sophisticated tasks for users.According to the authors the Semantic Web is not a separate Web but an extension of

5

Page 14: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

2. LITERATURE STUDY

FIGURE 2.1: Walls between social networks as presented by Tim Berners-Lee.

the current one, in which information is given well-defined meaning, better enablingcomputers and people to work in cooperation. Like the Internet, the Semantic Web willbe as decentralized as possible.

2.1.2 How the social web can be interlinked

Semantics in Twitter feeds and the profile of a user will be analyzed. An article by Bojarset al. “Interlinking the social web with semantics”[3] gives more insight in the relationbetween the current semantic and social web.

Bojars et al. discussed one of the most visible trends on the Web. Which is the emer-gence of Social Web sites, which help people create and gather knowledge by simplifyinguser contributions via blogs, tagging and folksonomies, wikis, podcasts, and online so-cial networks. They noted that current online-community sites are isolated from oneanother (see Figure 2.1), like islands in a sea. The main reason for this lack of interop-eration is that for the most part in the Social Web, common standards still do not existfor knowledge and information exchange. During the last couple of years, a lot of efforthas gone into defining standards for data interchange and interoperation. The SemanticWeb technology stack is well defined, enabling the creation of metadata and associatedvocabularies. The Semantic Web effort is in an ideal position to make Social Web sites in-teroperable. Applying Semantic Web frameworks such as SIOC (Semantically InterlinkedOnline Communities)[2] and FOAF (Friend-Of-A-Friend)[1] to the Social Web can lead toa Social Semantic Web creating a network of interlinked and semantically rich knowledge.

2.1.3 Which layers the semantic web consists of

Reading a comment in the column “Trends and Controversies” in the magazine “IEEEIntelligent Systems” by Steffen Staab[20] led to an interesting paper by Peter Mika[20]. Itsupports the conviction that the integration of social network data from different sourcesis very important. The information produced in social networks has true value since itcontains an extensive amount of knowledge. This knowledge is being communicatedbetween people that are a members from a specific research group or community.

There are however some issues to be considered. Two in particular stick out from thethick proceedings volumes: ontology learning and ontology mapping. Ontology learning

6

Page 15: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

2.1. A network of linked data

FIGURE 2.2: Three layers of the Semantic Web by Peter Mika

or extraction is the attempt to recreate a conceptual model from existing knowledgesources, in particular natural text. Ontology mapping (also known as merging, alignment,and so on) refers to finding and reconciling the relations between two or more conceptualmodels and creating a single model that captures their intentions and the relationshipsbetween them. They are explained very clearly in this article.

Staab stated:

Social networks have interesting properties. They influence our lives enor-mously without us being aware of the implications they raise: How does akind of fashion become en vogue? How does a virus spread and infect people?How does a research topic become a hot topic? Why are some companies suc-cessful and others are not? All these questions affect us, and understandingthem by building and investigating computational models might give us apowerful tool to improve our health system, increase individual and generalwealth, or just increase awareness about how the people around us actuallyinfluence our opinions, which we frequently believe that we shape.

Peter Mika considered a particular form of influence: the way that people agree onterminology and the phenomenon‘s implications for the way we build ontologies and theSemantic Web. In a nutshell, he reasoned that the Semantic Web will either include socialnetworks‘ influence in its architecture or wither away.The change of conceptualizations as communities evolve poses another challenge. Thischallenge is of course the “Ontology Mapping” he referred to earlier in his article. Themore unstable knowledge is, the more difficulty we can expect in formalizing and sharingit on a large scale. Mika included an illustration in Figure 2.2 that shows how communities,ontologies, and content make up the three layers of the Semantic Web.

2.1.4 What semantic profiling is about

An interesting document[12] in which Dave McComb, President of “Semantic Arts”, ex-plained out of his experience how one could conceptualize semantic profiling. He stated:

Semantic profiling is a technique using semantic-based tools and ontolo-gies in order to gain a deeper understanding of the information being stored

7

Page 16: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

2. LITERATURE STUDY

and manipulated in an existing system. This approach leads to a more sys-tematic and rigorous approach to the problem and creates a result that canbe correlated with profiling efforts in other applications.

There is no better way to express this concept. If applied to the scientific profiling project:The semantic analysis of Twitter users‘ profiles should help in a deeper understanding oftheir scientific relevance. It will also create more opportunities to correlate “Research 2.0”applications.

2.2 Social networks in this decade

In the past few years the impact of social networks kept increasing. Because of thesignificance a study of several social networks‘ properties is useful. A number of articleshighlight some specific properties that are of interest to this project.

2.2.1 Where the object centered sociality went

A five year old blogpost by Jyri Engestrom[9], co-founder of Jaiku, reads as if the problemis still actual. Engestrom notes that in the present social networks a very important part isoften left out. It is the part that describes what connects people. Whether it is anotherperson, a job, an event or a common interest. Many social networks make it difficult todisconnect from someone that is not known anymore or has an unknown origin. If socialnetworks would become object centered Ð like they are in real life, then one would nothave to deal with this issue. Online social connections would simply be build around theobjects that connect people.

2.2.2 How online communities can be interlinked

In an article[4] Breslin et al. presented different types of online communities and toolsthat were at that time used to build and support online communities. Those communitiesare islands that are not interlinked. The authors presented the SIOC ontology. The goal ofSIOC is to interconnect these online communities.

In the first section they presented the SIOC ontology. The ontology consists of twomajor parts: first, it contains classes and properties that describe discussion forumsand posts in online community sites. Second, it includes mappings that relate SIOC toexisting vocabularies such as FOAF and RSS. Breslin et al. elaborated on how the exchange,both importing and exporting data, can be executed. The core use of SIOC will be inthe exchange of instance data between sites. Wrappers will allow to export instances ofcommunity site concepts such as forums or posts in RDF format. They can also allow toimport SIOC instances to other non-SIOC systems. In the final section Breslin et al. talkedabout using SIOC Data. Given the ontology, the mappings, and the wrappers, they werenow able to pose queries and add data to individual SIOC sites. They highlighted threeaspects: browsing, querying and locating related information. The authors concludedthat to tackle the challenge of adoption they have provided an upgrade path that allows a

8

Page 17: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

2.3. A story told in triples

FIGURE 2.3: A triple by Peter Morville

gradual migration from existing systems to semantically-enabled sites. For combinationwith other ontologies they have presented mapping to and from SIOC.

2.3 A story told in triples

A triple is a structure that connects a subject node with an object node by a predicate link,see Figure 2.3. Data generated in social networks can not easily be converted into triples.Those triples have then to be made available to other users. Ongoing research points outseveral of these challenges and issues. A few important are outlined in this section.

2.3.1 How semantic microblogging with Twitter could work

This project‘s framework will have to deal with short messages of less than 140 characters.This is called microblogging. Joshua Shinavier wrote a summarizing paper[17] on howthis can be achieved. He introduced a semantic data aggregator which brings togethera collection of compact formats for structured microblog content with Semantic Webvocabularies and best practices in order to augment the Semantic Web with real-time,user-driven data. Obviously this is the direction for the research in this project.

Shinavier‘s paper takes the approach of harvesting semantic data embedded in thecontent of microblog posts or of doing for microblogs what microformats do for Webpages. This is complementary to “Semantic wikis” and the “Microformats” communitywho aim to bridge this gap by enabling users to add small amounts of semantic data totheir content. A number of compact formats have been proposed to allow users to expressstructured content or issue service-specific commands in microblog posts. So-called tripletags even allow the expression of something like a RDF triple. Microformats are subject toa tradeoff between simplicity and expressivity which heavily impacts community uptake.Shinavier gave the example of Twitter Data, Micro Turtle, Smesher and Twitlogic.

2.3.2 What a semantic microblogging architecture should look like

“SMOB” (Semantic MicrOBlogging) is an interesting system, because its architecture issimilar to the kind of architecture needed to realize the scientific profiling application.SMOB has been described in an article[14] about Microblogging by Passant et al. It alsodescribed the implementation of an initial prototype of this concept that provides waysto leverage microblogging with the Linked Data Web guidelines. At the time of writingmicroblogging services were (and still are today) centralised and confined. Efforts are stillto be made to let microblogging be part of the Social Semantic Web.

9

Page 18: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

2. LITERATURE STUDY

FIGURE 2.4: Exit to the semantic web.

The authors introduced classical microblogging and some of the issues it raises. Theauthors saw how the Semantic Web can help in getting rid of these issues and what it canoffer that traditional services could not achieve. Passant et al. then gave an overview ofmicroblogging and described why we should consider it and highlighted current issues.In the article they stated that they believe that the Semantic Web is an elegant solutionto opening these data from proprietary data-silos. It is a solution to providing machine-processable data and metadata to microblogging as well as to delivering an open anddistributed environment for microblogging.They wrote about the architecture of a semantic microblogging service. In order to modelthe metadata of a microblogging service, they relied on two widely used ontologies on theSocial Semantic Web: FOAF and SIOC.

To summarize this paper: it introduced the architecture and a first implementation ofa distributed semantic microblogging platform. While existing approaches to convertmicroblogging services to RDF already exist for Twitter, their approach relies on a completeopen and distributed view, using some standards of the Social Semantic Web. Moreover,some parts of their work, as the hash tag processing could be adopted to services such asTwitter to enable some semantics in existing tools.

2.3.3 Another case of data transformation

“SCOVO” (Statistical COre VOcabulary) is a vocabulary that supports systems where statis-tical data is being processed and linked to the semantic web. In the paper of Hausenblaset al. [10] this process and the use of SCOVO was explained. Their workflow is similar tothe one being implemented in this project.

There are three important steps and every step has its specific tools that aid in theimplementation. RDFication: with the help of domain vocabulary build RDF triples of theoriginal data. Interlinking: this step results in linked data sets. Publication: here URI‘s arepublished of the RDF and (X)HTML over HTTP. The metadata can be deployed as SPARQLendpoints + RDF Dumps, RDF XML or XHTML + RDFa.

The authors compared this approach with two others: D2R Eurostat and 2000 U.S. Cen-sus in an overview table. It is important to note that all approaches have their limitations.One can select an approach depending on what dataset is being dealt with and whattarget system is involved.

10

Page 19: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

2.3. A story told in triples

2.3.4 How mining microblogs using semantic technologies can be done

The framework for the semantic profiling tool fits like a puzzle piece in a bigger system thatis being developed in the research group “Social Networked Learning” at Graz Universityof Technology. Selver Softic of Infonova GmbH and Ebner et al. of the “Social Learning”department at TUGraz recently wrote a paper[18] about their ongoing research effortsaiming at knowledge discovery. They are aiming to provide a scientific architectureparadigm for building semantic applications that rely on social data.

For example they worked out an approach for interlinking and RDFising social e-Learning Web 2.0 platforms like ELGG based on semantic tagging and Linked Dataprinciples[19]. A special module called “SID” (Semantically Interlinked Data) was devel-oped to allow existing tagged and published user generated content an easy entrance intothe Web of Data and to enrich it semantically on the other hand.

At the moment Softic et al. are focussing on data from Twitter. For this purpose theyhave implemented a tool “Grabeteer”[13] for storing and caching social data. In thispaper they outlined the architecture for a system that can extract, structure and link thedata grabbed from Twitter by Grabeteer. They introduced the interesting aspects aboutmicroblogs, how far they correspond with ideas from other research areas like SemanticWeb or Linked Data. They also tried to answer how far those two areas can be combinedto gain more knowledge and mine usable data out of social context of microblogs. Finallythey presented an architectural paradigm approach that delivers the answer to specifiedresearch issue. This architectural paradigm is the basis for the software architecturedescribed in chapter ??.

2.3.5 Semantic Web Pipes for Semantic Mash-Ups

Something very promising is the concept of “SWP” (Semantic Web Pipes) similar to “YahooPipes”. At the DERI institute Le-Phuoc et al. have developed and tested a SWP system:“DERI Pipes”[7]. They presented the pipe concept[15] as a good basis for semantic webapplications using RDF. The authors said that the use of RDF data published on the Webfor applications is still a cumbersome and resource-intensive task due to the limitedsoftware support and the lack of standard programming paradigms to deal with everydayproblems such as combination of RDF data from different sources, object identifierconsolidation, ontology alignment and mediation, or plain querying and filtering tasks.Architectural styles have been around for several decades and have been the subject ofintensive research in other domains such as software engineering and databases. Theybased their work on the classical pipe abstraction and extend it to meet the requirementsof Semantic Web applications using RDF.

Le-Phuoc et al. found that the existence of standards and defacto standards for publish-ing RDF, key problem in systems processing RDF are:

The data is fragmented; may be incomplete, incorrect or contradicting;partly follows ontologies, often with ontologies used wrongly or inconsistently,to name a few, and thus needs to be “sanitized” before it can be processed. A

11

Page 20: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

2. LITERATURE STUDY

specifically cumbersome problem is the use of different identifiers denotingthe same object which need to be unified.

Web pipes are “live”: they are computed on demand when requested via an HTTPinvocation, and thus reflect an up-to-date state of the system (which can be detrimental aswell in some scenarios where caching would be applicable). The authors then continuedwith an example to motivate the use of semantic web pipes and give a concise overviewhow it works. They sketched the main functionalities and gave an overview of all theimportant operators. They also discussed the system design and implementation oftheir version of SWP. Finally they evaluated the system by means of a case study. Theauthors discussed some general remarks about the performance issues and commentedon the evaluation methodology (cognitive dimensions of notations). This is an interestingconcept that could greatly support the semantic profiling framework. At the time ofpossible use, in a later development phase, they should be investigated in more detail.

2.4 Conclusion

This chapter focused on some aspects of the semantic and the social web. The semanticweb was presented as a network of linked data. Some challenges about how the socialweb can be interlinked were outlined. Finally ongoing research projects showed that itis possible to translate social web data into triples. But the result of this process is stillnot accessible to casual users and the information has to be linked more accurately toontologies to create more relevant RDF data sources.

12

Page 21: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

Chapter

3Software Architecture

In this chapter the changes and improvements to the Software Architecture are lined out.The first report is here often referred to [6].

3.1 Design specifications

The framework has to support at least the scientific profiling application that meets therequirements to the use case presented in the chapter 1. Agile development suggests towork to use cases. Features will be added and implemented only if they are needed in ause case. The implemented features for the framework will be limited appropriately.

Based on the research work at TU Graz [19] the design consists of three layers: a dataextraction layer, an interlinking layer and an analysis layer. In addition a programming in-terface to this framework must be provided. At this point the main focus of the research ison the specification of the extraction layer. This is marked green in the diagram Figure 3.1.

The extraction layer is modeled as a bottom-up only system. This is because there isno real interaction with the above layers. The only request it has to handle is: "give meall data about a person". The other layers will be looked into as soon as the first layer isbeing implemented. Before this layer is finished, the development of the next layer muststart and so on. An iterative development plan supports this method. It is explained inchapter 5. The interlinking layer is built directly on top of the RDF store. It will no longerask the extraction layer directly. The extraction layer will only load triples into the RDFstore on user request (“add me to the system”) or in a cron job (“update users”). Thisquite fundamental change in communication is made clear in the new design diagramFigure 3.1.

The semantic profiling framework has to support a Scientific Profiling application aswas explained in the problem statement in chapter 1. The framework architecture stillconsists of three layers:

1. Extraction layer: Extracts data from various resources and annotates it using rele-vant ontologies for that specific data context.

2. Interlinking layer: Is feeded with annotated data (triples) and creates a SPARQLendpoint for it. It is responsible for requesting more data if needed for a certain

13

Page 22: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

3. SOFTWARE ARCHITECTURE

Interlinking Layer

Analysis Layer

Extraction Layer

Grabeteer Twitter

Scientific Profiling Application

Programming Interface

Twitter APISQL Queries

Triplification

SPARQL Queries

RDF Store

High Level Queries

FIGURE 3.1: The semantic profiling framework design updated.

information query. It parsers high level queries and translates them tot SPARQLQueries. The results are then being returned.

3. Analysis layer: Here a user information needs are translated into high level queriesthat the interlinking layer understands. It also contains some metrics to rank andevaluate the returned results.

3.2 Extraction Layer

The extraction layer grabs data from twitter or grabeeter, depending if a user exists in thegrabeeter database or not. This data is then annotated and stored in an RDF store andmade available for interlinking. It annotates for every Twitter user on load request: its

14

Page 23: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

3.3. Interlinking Layer

Grabeteer Twitter

Extraction

User Microblogs Model User Profile Model

Twitter APISQL Queries

Annotator Various Ontologies

Triplifier

Interlinking

FIGURE 3.2: The extraction layer represented as a package.

profile as SIOC [2] UserAccount, the timeline as SIOC(Types) MicroBlog, all the tweets asSIOC(Types) MicroBlogPost. It grabs the tweets from a user in Grabeeter if the user hasregistered there. If not they are being retrieved with the Twitter API. Other ontologies thatare used are the Dublin Core[?], FOAF [1] and GeoNames [?].

3.3 Interlinking Layer

It is impossible to create a generic framework that supports all data contexts, but we cancreate a system that supports a broad range of data contexts. For now we are focussing onthree data contexts:

1. User: Social Microblogs, annotated data from twitter users (SIOC, FOAF, DC, GeoN-ames);Purpose: since we are doing profiling, data from the user is an absolute must.

15

Page 24: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

3. SOFTWARE ARCHITECTURE

2. Domain: Scientific Conferences annotated data of scientific conferences;Purpose: to enable the framework to recognize and link to conferences.

3. General: OpenCalais Linked Data, CommonTag Ontology (with links to DBPedia,other Linked Open Data);Purpose: to give a meaning to topics and tags from a user.

This is being implemented in the framework in a set of modules in the form of PHPClasses. See Figure 3.3. The user data context is being realised by the module “SMNUser”.This module can handle simple requests such as “give me tags for a user”, “describe auser”, “give me the friends of a user”. The domain knowledge module “SMNDomain” willcollect tags and will be able to identify which tags are scientific conferences. Finally thegeneral knowledge is added by linking tags or entities that occur in the result set to LinkedOpen Data, this happens in the SMNLinkedData module.

Extending the framework with more domain knowledge, could quickly increase thenumber of applications the framework supports. In its current form the “semantic pro-filing framework” will be able to support most “Research 2.0” use cases. They are verysimilar to the two use cases presented here. It is about discovering new resources createdby researchers and of course the researchers and scientific events themselves.

In this way if more domain knowledge is feeded to the interlinking layer, there is noneed to completely rewrite the layer. Some additional modules to handle some domainspecific queries that come from the analyzing layer should suffice. At the same time theanalysis layer should have “no knowledge” of the data context behind it. It should merelyfocussing on profiling concepts that a researcher is interested in: “persons”, “organiza-tions/companies/institutions”, “tags/real world entities”, “topics”, “events”,“locations”and maybe others. The interlinking layer will translate queries concerning these conceptsto the various contexts it knows.

3.4 Analysis Layer

In the analysis layer modules are contained in the form of PHP classes that performalgorithms on the graph. Figure 3.4 shows the package and the four main sub contents:tags, entities, filters and finders. This graph is created during extraction and interlinkingphase. An abstraction of the triples is made. The higher level API and web services don’thave to care about the underlying semantic graph structure of the data.

People and events are represented by tree types of entity classes: “SMNProfile”, “SMN-Event” and “SMNPerson”. Tags, dates, locations and other entities are used as propertiesof these classes.

Tags can be filtered by a class named “SMNTagFilter”. Improvements for this class couldbe “Google Did You Mean” for spelling correction or a module to detect similar tags basedon “WordNET” ?? synonyms. Now a Porter stemming algorithm is used to reduce tags totheir base form. Tags such as “blogger" and “blog" are treated as the same: “blog”.

16

Page 25: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

3.4. Analysis Layer

Interlinking

SMNStore

Extraction

LODRDF Store

SMNUserQ SMNTagQ

SPARQL Queries

SPARQL Insert/Update Queries

SMNOpenDataLinker

SPARQL Endpoint

Profiling - Analysis Layer

Application - Service Layer

DBPediaGeoNames

User EventsInterests

SMNLocation

SMNConference

Colinda

FIGURE 3.3: Detail of the interlinking module

Other filters are a location filter and a date filter. The location filter allows to filter usersand events based on their location. The date filter will allow to only consider events in acertain time frame or include tweets from a specific time frame only.

A class named “SMNBalance” is used to compare and give a weight to users, events andtags. This class serves two “Finder” modules. These modules support the functionality tofind users and events based on several query parameters such as the user or event itselfand date and location options.

A demonstration where two twitter users can be compared based on similar hashtagsthey use. An evaluation rating is now simply the cosine similarity between the two setsof hashtags. The next step in the analysis is to identify conferences among their similarhashtags. This will be done in two ways: direct and indirect. The direct identification

17

Page 26: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

3. SOFTWARE ARCHITECTURE

Web Service Layer - API

Interlinking

Analysis

Tags

Google DYM Module

PorterStemmer SMNTagFilter

WordNet

SynonymFinder

Entities

SMNPerson

SMNEventSMNProfile

Filters

LocationFilter

EventDateFilter

TweetDateFilter

Finders

SMNEventFinder

SMNUserFinder

SMNBalance

Google

FIGURE 3.4: The analysis layer represented as a package.

18

Page 27: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

3.5. Web Service Layer

FIGURE 3.5: Demo of the matching between two twitter users based on hashtag similari-ties

will just use the hashtag and see if it matches a conference name or abbreviation. Theindirect way is to see if any of the hashtags also occur in tags or keywords extractedfrom the conference description. Then a new metric giving weights to this conferencematches could be calculated. This can be particularly interesting to inform users of afuture conference that matches their interests. The current version of the demo is availableonline [?]. A screenshot of the demo see Figure 3.5.

3.5 Web Service Layer

The web service layer is made available under the form of an API. The calls can be doneby REST API calls that return a JSON object. See Figure 3.6 and Figure 3.7.

USER PROFILE http://api.semanticprofiling.net/profile.php?user= screenname

19

Page 28: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

3. SOFTWARE ARCHITECTURE

FIGURE 3.6: User Profile JSON Response

FIGURE 3.7: People Discovery based on specific User Profile only JSON response

20

Page 29: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

3.6. Conclusion

DISCOVER PEOPLE AND EVENTS http://api.semanticprofiling.net/discovery.php?find=persons|events|popularfriends|popularmentions|populareventsuser=screenname

REGISTER NEW USERS http://api.semanticprofiling.net/register.php?user= twitteruser

EVENT DETAILS http://api.semanticprofiling.net/event.php?name= eventname

3.6 Conclusion

The framework that will support the scientific profiling application is organized in threelayers. An extraction layer, an interlinking layer and an analysis layer. The extractionlayer collects data from a user from and its tweets from Grabeeter or Twitter. This datais annotated with appropriate ontologies, transformed into triples and stored in an RDFStore. Then this data is being interlinked with various linked open data and representedas a SPARQL endpoint. The analysis layer makes abstraction of the underlying graphstructure. It represents the annotated and interlinked data as Profile, Event or Personentities. The API makes use of this representation and the analysis modules that allowelegant querying of the graph.

21

Page 30: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.
Page 31: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

Chapter

4Implementation Details

4.1 Extraction Layer

The extraction layer makes use of the PHP scripting language. The data from Twitteris retrieved with the help of a Twitter API [21]. The data from Grabeeter is accessed viaMySQL. This is the fastest way. A cron job keeps the Grabeeter database up to date. If atwitter user exists in Grabeeter then is data is retrieved from there if not then his data isretrieved directly from Twitter. After retrieval the data is then annotated using appropriateontologies [6]. The hashtags are identified and linked to the triples. For now all hashtagsare being refered to an unverified DBPedia resource. For some tags the DBPedia resourceactually exists. At query time the hashtags will be verified and further interlinked tomore Open Data sources. Verifying all hashtags at load time significantly increases theannotation time.

In the extraction phase it is not necessary yet to format the triples into RDF. A simplePHP class TripleTree represents them. It contains a subject and a mapping to all propertiesand objects. This makes it a lot easier to collect object properties that link to the samesubject node. There is a TripleTree for each post and a TripleTree for the user timeline.The user timeline forms the connection between the user and all of its posts. This is doneinside the annotator module. Then all the TripleTrees are converted into a list of triples.This list is partitioned in sizable pieces and then inserted in the RDF store.

A proof of the valid xml that results from the extraction process in Figure 4.1. Thevalidation is done by the W3C RDF Validator [22]. It takes about 3 minutes to extractabout 12000 triples for a user with 2000 tweets. For now there is time limit on the script toprevent system crashing. A user with more than 30000 triples takes at the moment morethan 5 minutes to generate the XML document. This is because the extraction algorithmis >O(n**2). There are three layers in the extraction algorithm:

for (every tweet) (for (all properties) (something >O(1)))

23

Page 32: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

4. IMPLEMENTATION DETAILS

FIGURE 4.1: W3C validation of a twitter user RDF/XML produced in the extraction layer

The triples that are stored in the RDF store form a basis as a social data resource, con-sisting of people represented as their semantically annotated Twitter identities. Anotherrepository (Colinda), currently under development, will keep track of conferences. It ispossible to load any twitter user and store the annotated triples in this RDF store.

4.2 Interlinking Layer

First thing was to set-up the RDF store and SPARQL Endpoint. The modules in the inter-linking layer interact with the RDF Store using SPARQL Queries. The SPARQL enpdointcan be accessed on the web [5].

An example queries all persons who live in “Austria” in Figure 4.2 and displays theresulting screennames in Figure 4.3. There are definitely more users that actual have areference to Austria. But therefore we must optimize the interlinking between the differentlocations. A query for “Austria” for example should also return the people that have onlyVienna as their location. This will be a subject of the next iteration.

4.3 Analysis Layer

It could be interesting to enable spelling correction (for example Google DYM). Sometags might refer to the same concept. Using WordNet we could identify some of those

24

Page 33: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

4.4. Issues

FIGURE 4.2: SPARQL Example Query

FIGURE 4.3: SPARQL Query Result

concepts an link them together. Not all tags are equally important in the population. Tagscould be weighted according to the number of references they have (DBPedia, GeoNames,Colinda etc.) or simply by the number of links to that tag. In this way we get somethinglike a TFIDF weight. we can count the number of occurences of the tag for one user andthen compare it to the total occurences of the tag.

4.4 Issues

Ideally we also want to get rid the URIs that are actually REST API calls such as Figure 4.4and rather point them to an RDF resource or a resource that exists in the local namespace.

25

Page 34: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

4. IMPLEMENTATION DETAILS

FIGURE 4.4: Not optimal URI, it does not refer to an RDF resource

FIGURE 4.5: Better URI, refers to a local URI with RDF representation

Such a referencing is implemented for persons and tags already (Figure 4.5). Furthermoreit has been made sure that these URIs have at least a basic representation

Performance could become an issue after all. It is however not the focus of this project,but it must be made sure that we can set-up a relevant data source and keep it up to datein a reasonable amount of time. For now we load only the most recent 250 tweets of everyuser and we limit the number of friends also to 250.The extraction time for one user is 1:30 minutes. It takes about 5 minutes to interlink 250tags. The 350 Twitter users and 32 000 tags loaded in the system result in about 850 000triples.

An interesting way to limit the number of users is to work with several instances ofGrabeeter. A user interested in a specific topic or user could track the Twitter stream orhistory and include all mentions and tweeters related to the search query. As more userswill be loaded into the system, a bigger chance of finding a relevant match occurs.

Hashtags seldom represent real word entities. This means that in most cases they don’trefer to a real location, DBPedia resource. It might be interesting to consider an EntityRecognizer such as the one from OpenCalais to discover more entities talked about intweets.

The result of the extraction is a huge graph of users. It might be interesting to use agraph based algorithm to identify communities of people that have lots of common links.These can be centered around specific events or simply other users.

26

Page 35: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

Chapter

5Project Plan

Now the project plan is carried out. The system is shortly described. This chapter explainsin more detail the work that has been done so far. It shines a light one the next fewiterations. Finally the schedule summarizes this entire chapter

5.1 Overview

The choice for a plan with an iterative development allows agility. This ensures that everycycle evaluates the previous one and builds up to the next one. If adjustments have to bemade they will be scheduled for the next iteration. Sufficient margin guarantees that allimportant milestones, will be met. The important milestones are at then end of January,March and May.

The project started with a literature study and a global design. Based on those an initialproject plan was set-up. Every 2 iterations the design specifications are reviewed andupdated in a report such as this one. Before every milestone there is an integration of theimplementation in the iterations that led to that milestone.

Every iterations starts with verifying a list of specifications that are of course not im-plemented yet. First bugs and problems from the previous iteration are being fixed. Atabout the same time features to meet the specifications are being implemented as well.After fixing the bugs and implementing new features they are tested. In this project onlyqualitative black box testing is done. The only important thing to verify is if a moduledoes what it specified to do. Performance tests or any in depth code analysis is totallyirrelevant for this project.

5.2 Previous iterations

So far seven iterations of the research and development have been carried out. A quickdescription follows in this section. The first layers, which are very low level, have beendeveloped. The results from the tests at the end of iteration two served as a startingpoint. The first two iterations allowed the exclusion of some tools and frameworks arealready excluded, since the early evaluation proved them not suitable for this case. In

27

Page 36: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

5. PROJECT PLAN

the following iterations the interlinking and analysis layer were developed designed anddeveloped.

5.2.1 Iteration 1

A selection of literature informed about the current state of research. This selection servesas the basis repository for the next few months. To make it easy accessible, the entirelibrary has been put online on a Mendeley account. Some particular papers turned out tobe very interesting as a starting point for this project. They were studied more in depth inthe second iteration.

5.2.2 Iteration 2

The previous iteration identified some interesting papers. They formed the basis of articlesfor the literature study in this thesis project. The summaries and comments are describedin the dedicated chapter 2. Furthermore a blog was setup to keep track of the researchand development efforts made. Some early tests on existing systems that could supportthe semantic profiling framework were performed.

5.2.3 Iteration 3

SPECIFICATIONS Implement the extraction layer modules in the PHP language accordingto the design created at the end of iteration 2.

BUG FIXES Since this was the first iteration with code implementation there were nobugs to fix.

NEW FEATURES IMPLEMENTATION The user profile model and the user tweets modelwas implemented. Then the annotator module and the triplifier module were coded.

TEST NEW FEATURES To test the new features several PHP web pages used functions inthe annotator and triplifier module to see if good results were produced. Results weredisplayed in simple plain text or in a HTML table.

MARK BUGS TO FIX IN THE NEXT ITERATION No bugs were found

DEFINE SPECIFICATIONS FOR NEXT ITERATION The triples still have to be stored in theRDF store. A SPARQL Endpoint still needed to be set-up. A more detailed design of theinterlinking layer was created.

5.2.4 Iteration 4 and 5

SPECIFICATIONS Store the triples in the RDF Store. Set-up a SPARQL endpoint. Im-plement the Interlinking modules according to the design diagram created in iteration3.

BUG FIXES There were no bugs to fix.

NEW FEATURES IMPLEMENTATION First the Store module was implemented, it createsthe connection with the RDF Store. On top of that the User and Domain modules wererealized. The OpenDataLinker and the QueryParser are postponed.

28

Page 37: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

5.2. Previous iterations

TEST NEW FEATURES To test the new features a demonstration application in PHP wasimplemented. It matches two twitter users based in similar hashtags and uses the mainfunctions of two important modules of the Interlinking layer.

MARK BUGS TO FIX IN THE NEXT ITERATION No bugs were found. We expect bugs inhigher level code. The code now is quite straightforward and errors are quickly discoveredduring the coding.

DEFINE SPECIFICATIONS FOR NEXT ITERATION Finish the remaining two Interlinkingmodules, starting with the QueryParser. Design the Analysis layer with the ScientificProfiling application in mind.

5.2.5 Iteration 6

SPECIFICATIONS Design and implement an architecture that supports the viewing ofpersons (Twitter users) and events (Conferences). Furthermore it must be possible todiscover persons and events based on common links (directly or via entities). Add filteringoptions for both location and date of both the events and tweets.

BUG FIXES Some queries take time to carry out, optimize the code in the interlinkinglayer so that the number of SPARQL queries is minimized.

NEW FEATURES IMPLEMENTATION First a TagFilter was created, it allows stemming ofthe tags and Google DYM. For now the DYM module is not used because it is to slow. Thena data class for the user profile and person was implemented. The filter for locationswas implemented. Finally a class to compare tags and users formed the basis for theUserFinder class.

TEST NEW FEATURES Subject of the next iteration.

MARK BUGS TO FIX IN THE NEXT ITERATION No bugs were found. We expect bugs inhigher level code. The code now is quite straightforward and errors are quickly discoveredduring the coding.

DEFINE SPECIFICATIONS FOR NEXT ITERATION Not all filters have been implemented yet.

5.2.6 Iteration 7

SPECIFICATIONS Implement the filters for event and tweet dates. Implement the EventFinderclass. Implement the API and integrate it with the UserFinder. Verify that it works correctly

BUG FIXES There were no bugs to fix.

NEW FEATURES IMPLEMENTATION Two API calls were implemented: profile and discover.There was no time to implement the EventFinder and date filters.

TEST NEW FEATURES The output of the API calls allowed the verification of the alreadyimplemented User-related analysis modules.

MARK BUGS TO FIX IN THE NEXT ITERATION Listing friends is a very expensive call,considering other options.

29

Page 38: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

5. PROJECT PLAN

DEFINE SPECIFICATIONS FOR NEXT ITERATION Finish the remaining modules and API. Es-pecially the EventFinder module is very important. Afterwards the focus should be on theend-user application that assists scientists in discovering interesting events (conferences)and people.

5.3 Schedule

The following schedule represents the project plan and how it is being carried out. Detailsare in table 5.1.

5.3.1 Notes

During the first part of the plan the semantic and the social web is researched. This is thebasis for the development of a semantic profiling framework and API. This API will bethe foundation for the development of an application that fits the Research 2.0 use caseintroduced in this project. It is worth noting that the second part foresees more time toperform all tasks. This is necessary as at the end of that part the final thesis report mustbe written.

5.3.2 Changes

We changed the project plan only a short while ago. No changes to the project planwere necessary. Except that we had to foresee some time to write a conference papersubmission. This paper describes the semantic profiling framework so far. It was writtenat the same time as this report

5.4 Conclusion

No changes were necessary to the previously updated project plan. The structure of theworking method in each iteration is explained. The entire schedule from the project planis repeated in a table in this chapter.

30

Page 39: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

5.4. Conclusion

From To Weeks (#) Work Load (est. hours)

Target/Task

PART 1 TUGrazIteration 1

Iteration 2

Report 1

Milestone 1Iteration 3

Iteration 4

Report 2

Milestone 2aIteration 5 (Christmas)

Milestone 2bIteration 6

Iteration 7

Report 3

Milestone 3

TOTAL PART 1

PART 2 KULeuvenIteration 8

Iteration 9

Iteration 10

Report 4

MIlestone 4Iteration 11

MIlestone 5Iteration 12(Easter)

Iteration 13

Report 5

Milestone 6Report 6

MIlestone 7

TOTAL PART 2

TOTAL

4-Oct-10 24-Jan-11 Main objective

Framework development for Semantic analysis of twitter feeds and extended user profile synthesis

4-Oct-10 17-Oct-10 2 40 Get familiar with current research (papers)

18-Oct-10 30-Oct-10 2 40 Research and evaluate relevant aspects more in depth

1-Nov-10 7-Nov-10 1 20 Write first report

8-Nov-10 1st written report8-Nov-10 21-Nov-10 2 40 Develop Extraction Layer

22-Nov-10 5-Dec-10 2 40 Test Exteraction Layer, Develop Interlinking Layer

6-Dec-10 12-Dec-10 1 20 Write second report & ESWC2011 Paper

13-Dec-10 ESWC DL & 2nd report14-Dec-10 2-Jan-11 2 (+ some Holidays) 40 Margin: used for unfinished work in it4 - start it6 earlier

21-Dec-10 1st presentation3-Jan-11 16-Jan-11 2 40 Test Interlinking Layer, Develop Analysis Layer / API

17-Jan-11 30-Jan-11 2 40 Test, debug and refactor API

31-Jan-11 7-Feb-11 1 20 Integrate the 1st&2nd reports & add report of it5&6&7

7-Feb-11 End of work at TUGraz

17 340

14-Feb-11 30-Jun-11 Main objective

Develop a user interface that fits in a scientist’s ‘Research 2.0’ workflow

14-Feb-11 27-Feb-11 2 40 Find out more about Research 2.0 applications & challenges

28-Feb-11 10-Mar-11 1,5 30In several iterations try to develop a solid user interface and implement it in an appropriate technology. Try optimize integration capabilities of the framework/API developed in part 1. Gather real user feedback! Evaluate the usability of the semantic analysis and profiling with this interface.

11-Mar-11 20-Mar-11 1,5 30 In several iterations try to develop a solid user interface and implement it in an appropriate technology. Try optimize integration capabilities of the framework/API developed in part 1. Gather real user feedback! Evaluate the usability of the semantic analysis and profiling with this interface.

21-Mar-11 27-Mar-11 1 20

In several iterations try to develop a solid user interface and implement it in an appropriate technology. Try optimize integration capabilities of the framework/API developed in part 1. Gather real user feedback! Evaluate the usability of the semantic analysis and profiling with this interface.

28-Mar-11 3rd written report

In several iterations try to develop a solid user interface and implement it in an appropriate technology. Try optimize integration capabilities of the framework/API developed in part 1. Gather real user feedback! Evaluate the usability of the semantic analysis and profiling with this interface.29-Mar-11 11-Apr-11 2 40

In several iterations try to develop a solid user interface and implement it in an appropriate technology. Try optimize integration capabilities of the framework/API developed in part 1. Gather real user feedback! Evaluate the usability of the semantic analysis and profiling with this interface.

12-Apr-11 Second presentation 8

In several iterations try to develop a solid user interface and implement it in an appropriate technology. Try optimize integration capabilities of the framework/API developed in part 1. Gather real user feedback! Evaluate the usability of the semantic analysis and profiling with this interface.

13-Apr-11 1-May-11 2 (+ some Holidays) 40 Margin

2-May-11 15-May-11 2 40 Optimize implementation of the system.

16-May-11 29-May-11 2 40 Write final report

30-May-11 Final written report 8 Review final report

30-May-11 12-Jun-11 2 20 Preparation for final presentation

End of june Final presentation 8 Review final presentation

14 324

31 664Avg work load 21

Margin 80

TABLE 5.1: The research schedule.

31

Page 40: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.
Page 41: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

Chapter

6Conclusion

The literature study in chapter 2 highlighted some issues and challenges in the currentsemantic web. It shows that to make the social web a fruitful source for data there is still ahuge leap forward needed. Both accessing and connecting the data are important issues.Social networks are like isolated islands. The information contained in there is just simplyviewed by a few people and then stored. After storage it is not put into further practicaluse.

The architecture of the framework consists out of three layers: a data extraction layer,an interlinking layer and an analysis layer. An API, either a web service or a distributablepackage, will provide high level support for a scientific profiling application. The designwill grow more specific as the project evolves. An iterative development system will makethis possible.

The architecture description in chapter 3 explained the extraction layer and interlinkingin layer in more detail. Important details about the implementation of these layers wasgiven in chapter 4.

The project plan foresees several iterations. This allows agility in the development.In every iteration the previous one is evaluated. If changes are necessary they will bescheduled for the upcoming iteration. This process will continue cyclically till a majormilestone is reached. There is enough margin to ensure that the major milestones canbe met. Details about the method of how each iteration is approaches is outlined in theproject plan in chapter 5.

33

Page 42: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.
Page 43: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

Bibliography

[1] The friend of a friend (foaf) project. URL: http://www.foaf-project.org/.

[2] Semantically interlinked online communities project. URL: http://sioc-project.org/.

[3] U. Bojars, J. G. Breslin, V. Peristeras, G. Tummarello, and S. Decker. Interlinking thesocial web with semantics. pages 1–12, May 2008.

[4] J. G. Breslin, A. Harth, U. Bojars, and S. Decker. Towards semantically-interlinkedonline communities. The Semantic Web: Research and Applications, pages 500–514,2005.

[5] L. De Vocht. Semantic profiling network linked data sparql endpoint. URL:http://linkeddata.semanticprofiling.net/interlinking/endpoint_handler.php.

[6] L. De Vocht. Master thesis: Progress report 1, 2010.

[7] DERI. Deri pipes. URL: http://pipes.deri.org.

[8] M. Ebner, H. Mühlburger, and S. Schaffert. . . . Getting granular on twitter: Tweetsfrom a conference and their limited usefulness for non-participants. Key Competen-cies in . . . , Jan 2010.

[9] J. Engestrom. Why some social network services work and others do not - or: the casefor object-centered sociality. URL: http://www.zengestrom.com/blog/2005/04/.

[10] M. Hausenblas, W. Halb, Y. Raimond, L. Feigenbaum, and D. Ayers. Scovo: Usingstatistics on the web of data. The Semantic Web: Research and Applications, pages708–722, 2009.

[11] T. Lee, J. Hendler, and O. Lassila. . . . The semantic web. Scientific American, Jan 2001.

[12] D. McComb. Semantic profiling - an approach to understanding datain an existing system. URL: http://semanticarts.com/articles/semantics-and-ontologies/semantic-profiling, Sep 2004.

35

Page 44: Master Thesis: Progress Report 3 - WordPress.com · 2011. 2. 16. · profilingin social networks. 2.1 A network of linked data The semantic web represents a network of linked data.

BIBLIOGRAPHY

[13] H. Muhlburger, M. Ebner, and B. Taraghi. @twitter try out #grabeeter to export,archive and search your tweets. pages 1–9, Aug 2010.

[14] A. Passant, T. Hastrup, U. Bojars, and J. G. Breslin. Microblogging: A semantic anddistributed approach. Proceedings of the 4th Workshop on Scripting for the SemanticWeb, 2008.

[15] D. L. Phuoc, A. Polleres, M. Hauswirth, G. Tummarello, and C. Morbidoni. Rapidprototyping of semantic mash-ups through semantic web pipes. Proceedings of the18th international conference on World wide web, pages 581–590, 2009.

[16] W. Reinhardt, M. Ebner, G. Beham, and C. Costa. How people are using twitter duringconferences. Hornung-Prähauser, V., Luckmann, M.(Hg.): 5th EduMedia conference,Salzburg, pages 145–156, 2009.

[17] J. Shinavier. Real-time# semanticweb in<= 140 chars. Proceedings of the ThirdWorkshop on Linked Data on the Web (LDOW2010) at WWW2010, 2010.

[18] S. Softic, M. Ebner, H. Muhlburger, T. Altmann, and B. Taraghi. @twitter mining#microblogs using #semantic t echnologies. pages 1–12, Sep 2010.

[19] S. Softic, B. Taraghi, and W. Halb. Weaving social e-learning platforms into the webof linked data. pages 1–9, Jul 2009.

[20] S. Staab. Social networks applied. pages 1–14, Jan 2005.

[21] T. Verkoyen. Php twitter api with oauth. URL: http://classes.verkoyen.eu/twitter_oauth.

[22] W3C. Rdf validator. URL: http://www.w3.org/RDF/Validator/.

36