Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of...

15
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1 , Fedor Bakalov 2 , Birgitta K¨ onig-Ries 2 , Martin Welsch 1 1 IBM Research and Development Sch¨ onaicher Str. 220, 71032 B¨ oblingen, Germany {andreas.nauerz|martin.welsch}@de.ibm.com 2 University of Jena, Institute of Computer Science Ernst-Abbe-Platz 1-4, 07743 Jena, Germany {fedor.bakalov|koenig}@informatik.uni-jena.de Abstract In order to efficiently use information, users of- ten need access to additional background in- formation. This additional information might be stored at various places, such as news web- sites, company directories, geographic informa- tion systems, etc. Oftentimes, in order to ac- cess these different pieces of information, the user has to launch new browser windows and direct them to appropriate resources. In our today’s Web 2.0, the problem of accessing back- ground information becomes even more promi- nent: Due to the large number of different users contributing, Web 2.0 sites grow quickly and, most often, in a more uncoordinated way re- garding, e.g., structure and vocabulary used, than centrally controlled sites. In such an en- vironment, finding relevant information can be- come a tedious task. In this paper, we propose a framework allow- ing for automated, user-specific annotation of content in order to enable provisioning of re- lated information. Making use of unstructured data analysis services like UIMA or Calais, we are able to identify certain types of enti- ties like locations, persons, etc. These enti- ties are wrapped into semantic tags that con- 0 Copyright c 2008 Andreas Nauerz, Fedor Bakalov, Birgitta K¨onig-Ries, Martin Welsch, IBM Deutschland Entwicklung GmbH, and University of Jena. Permis- sion to copy is hereby granted provided the original copyright notice is reproduced in copies made. tain machine-readable information about the entity type. The entity types are associated with applications able to provide background information or related content. A location, e.g., could be associated with Google Maps, whereas a person could be associated with the company’s employee directory. However, it strongly depends on the individual user’s in- terests and experience which additional infor- mation he deems relevant. We therefore tai- lor the information provided based on the User Model, which reflects the user’s interests and expertise. This allows providing the user with in-place, in-context background information on those entities he is likely to be interested in as well as with recommendations to related con- tent for those entities. It also relieves users from the tedious task of manually collecting rel- evant additional information. Our main concepts have been prototypically embedded within IBM’s WebSphere Portal. 1 Introduction Consider a manager using the company por- tal to read a news bulletin about rumors of a planned merger of her company’s competitor with another company and the consequences this has for the stock market. In order to make the right decisions based on this piece of information, she needs additional informa- tion: background information on this specific

Transcript of Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of...

Page 1: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

Personalized Recommendation of Related Content Based on

Automatic Metadata Extraction

Andreas Nauerz1, Fedor Bakalov2, Birgitta Konig-Ries2, Martin Welsch1

1IBM Research and DevelopmentSchonaicher Str. 220, 71032 Boblingen, Germany

{andreas.nauerz|martin.welsch}@de.ibm.com

2University of Jena, Institute of Computer ScienceErnst-Abbe-Platz 1-4, 07743 Jena, Germany

{fedor.bakalov|koenig}@informatik.uni-jena.de

Abstract

In order to efficiently use information, users of-ten need access to additional background in-formation. This additional information mightbe stored at various places, such as news web-sites, company directories, geographic informa-tion systems, etc. Oftentimes, in order to ac-cess these different pieces of information, theuser has to launch new browser windows anddirect them to appropriate resources. In ourtoday’s Web 2.0, the problem of accessing back-ground information becomes even more promi-nent: Due to the large number of different userscontributing, Web 2.0 sites grow quickly and,most often, in a more uncoordinated way re-garding, e.g., structure and vocabulary used,than centrally controlled sites. In such an en-vironment, finding relevant information can be-come a tedious task.

In this paper, we propose a framework allow-ing for automated, user-specific annotation ofcontent in order to enable provisioning of re-lated information. Making use of unstructureddata analysis services like UIMA or Calais,we are able to identify certain types of enti-ties like locations, persons, etc. These enti-ties are wrapped into semantic tags that con-

0Copyright c© 2008 Andreas Nauerz, Fedor Bakalov,Birgitta Konig-Ries, Martin Welsch, IBM DeutschlandEntwicklung GmbH, and University of Jena. Permis-sion to copy is hereby granted provided the originalcopyright notice is reproduced in copies made.

tain machine-readable information about theentity type. The entity types are associatedwith applications able to provide backgroundinformation or related content. A location,e.g., could be associated with Google Maps,whereas a person could be associated with thecompany’s employee directory. However, itstrongly depends on the individual user’s in-terests and experience which additional infor-mation he deems relevant. We therefore tai-lor the information provided based on the UserModel, which reflects the user’s interests andexpertise. This allows providing the user within-place, in-context background information onthose entities he is likely to be interested in aswell as with recommendations to related con-tent for those entities. It also relieves usersfrom the tedious task of manually collecting rel-evant additional information.

Our main concepts have been prototypicallyembedded within IBM’s WebSphere Portal.

1 Introduction

Consider a manager using the company por-tal to read a news bulletin about rumors of aplanned merger of her company’s competitorwith another company and the consequencesthis has for the stock market. In order tomake the right decisions based on this pieceof information, she needs additional informa-tion: background information on this specific

Page 2: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

competitor, the company it might merge with,and a summary of the stock value over the lastfew months. Maybe she would like to talk tosomeone from her research department who isfamiliar with the technologies that will belongto one company if the merger comes through.Today, most likely, all this information is avail-able online. In order to access it, the managerwill open new browser windows, will enter theappropriate terms in a search engine (either aglobal one or one that searches the company’sintranet) and will eventually locate this infor-mation.

Now, imagine a different scenario: When themanager moves her mouse over the name ofthe competitor, a new portlet containing rel-evant information about the competitor popsup, when moving to a mention of the technol-ogy, a new window with the related wikipediapage opens. In a separate window a list of ex-perts from her department together with con-tact details and additional resources relatedto the news item are displayed. Clearly, thiswould be a lot more efficient than the first ap-proach. It might even happen that related in-formation, that the manager was not aware of,is brought to her attention: in the popup a newproject running by one of the company’s re-search centers could be mentioned, that mightbe of interest in this context.

In this paper, we present our approach tomaking this happen. What is needed to real-ize this vision? First of all, we need a compo-nent that automatically analyzes the content,identifies terms that describe certain types ofentities (e.g., persons, locations, companies, in-dustry terms, etc.). Our solution uses the Un-structured Information Management Architec-ture (UIMA)1 and the Calais2 web service toachieve this. Second, we need to link entitytypes to applications that can provide back-ground information. For instance, the loca-tion entity type could result in a call to GoogleMaps etc. Third, we need to make sure topresent the user only with the background in-formation he is interested in. Users will not beinterested in background on things that are be-yond their sphere of interest or that they knoweverything about, already.

1http://www.research.ibm.com/UIMA/2http://www.opencalais.com/

The remainder of this paper is organized asfollows. Section 2 provides a short overview ofthe previous research that has been done in thearea of annotation, personalization, and usermodeling. Section 3 provides background infor-mation on the main concepts used in this pa-per as well as outlines the system architectureof our basic recommender system that we im-plemented in our previous work and describesits drawbacks. Section 4 provides informationabout the improved system architecture. Fi-nally, Section 5 discusses limitations and futureextensions of our system.

2 Related Work

The recommender system described in this pa-per is related to works in the areas of semanticannotation and tagging, personalization, anduser modeling.

Early research work on content annotationpursued a bottom-up approach to enrich con-tent of web pages with accompanying seman-tic annotations. The proposed systems helpedto complete two tasks: first, to create ontolo-gies and second, to annotate web pages withontology derived semantic annotations. But itturned out that leaving the annotation work upto human beings is a tedious task [8].

Today a number of tools for producing se-mantic annotations exist (Protege [16], On-toAnnotate [19], Annotea [11], CREAM [9],etc.). Authoring tools help to extract meta-data from contents of web pages, but, most of-ten, force authors to first create the contentand then annotate it in an additional anno-tation step. Obviously these systems assumethat content is created once and rarely up-dated. This assumption is not true in today’sWeb 2.0 world where a lot of users continuouslycontribute content.

To automate the task of annotating contenta lot of research has been done in the field of in-formation extraction whose goal is to automat-ically extract structured information from un-structured machine-readable documents. Al-though the approach to perform the extractionis often differing, most papers in this area re-gard information extraction as a proper wayto automatically extract semantic annotations

2

Page 3: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

from web content. Thus, such systems (e.g. theWorld Wide Wrapper Factory [18]) allow forthe construction of wrappers by explicit def-inition of HTML or XML queries. They tryand extract detailed structural data out of thepages and require significant training beforethey can be productive.

Other techniques are based on machinelearning: in [21] Yang describes a machinelearning approach to extract semantic annota-tions from web content. Another system ap-plying large-scale text analytics to extract se-mantic annotations is SemTag [7]. Similarly Liet al. [23] propose a machine learning basedautomatic annotation approach that allows forannotating web pages at a sentence level.

In [10], Handschuh et al. describe a frame-work for metadata creation when web pages aregenerated from a database and the databaseowners are cooperatively participating in theSemantic Web. In order to create metadata,the framework combines the presentation layerwith the data description layer.

Recently, a few novel approaches have beenproposed to address the problem of personal-ized annotation. In [5], Chirita et al. describea method that automatically generates person-alized annotations of website content based onthe information stored on the user’s desktop.In other work, Ankolekar et al. [3] proposedan architectural solution of website personal-ization that takes place at the client side andallows personal information about the peopleto be stored locally.

An important component in personalizedrecommender systems is the user model. Alot of research has been done in the area ofuser modeling. A considerable amount of usermodeling approaches concentrate on applica-tion of semantic technologies for representa-tion and acquisition of user interests and pref-erences. Pereira and Tettamanzi [17] describean automated method for acquisition of usermodels. According to their approach, the inti-tal user models are constructed using the dataexplicity provided by the users at the time ofregistration. The models are then being con-stantly updated based on the data collectedfrom the users’ interaction with the system.Another user modeling approach is presentedby Achananuparp et al. in [1]. The authors

describe a method for building semantically en-hanced user models based on the analysis ofclickstreams and web usage logs.

Our system is based on our previous workon automatic metadata extraction. In [14], wepresented a recommender system that identi-fies information pieces of certain types in textfragments and automatically wraps them intosemantic tags, which we then connect to other(external) information sources in order to rec-ommend background information and relatedcontent.

However, our previous recommender systemdid not take into account the user interests andpreferences. This led us to a problem of pro-ducing large number of irrelevant recommen-dations. The system highlighted every taggedfragment and linked it to external resourceswithout consideration of relevance of this in-formation to the user. To rectify this problem,we introduced a user model that represents theknowledge about user interests. We use thismodel to select only those tagged fragmentsthat contain information that is of interest forthe user.

3 Background

In this section, we give a brief overview of ourprevious work that the approach proposed inthis paper is based upon. We will first de-scribe the basic concepts underlying that ear-lier system, then we will introduce the systemarchitecture and, finally, we will identify theweaknesses of the previous approach. The ex-tensions proposed in this paper are aimed atovercoming these weaknesses.

3.1 Concepts

In the following subsections we will first de-scribe an automatic method for extracting se-mantic annotations from text (or markup, re-spectively). We need these annotations as thebasis for recommendations. We then describethe different concepts for recommending (ex-ternal) information sources that provide userswith background information or related con-tent. Finally, we describe the mechanisms fortailoring Web content to the needs of individualusers.

3

Page 4: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

3.1.1 Annotation

Before we continue we want to provide a def-inition for the terms annotation and semantictagging: we define annotation as the generalprocess of adding meta information about anartifact (e.g. an already semantically taggedinformation piece inside a text). We define se-mantic tagging as the process of adding metainformation by injecting (embedding) tags toXHTML fragments. Thus, semantic tagging isentirely performed at the markup level.

We distinguish three types of annotation.First, automatic annotation, where the sys-tem analyzes markup to find occurrences ofidentifiable information pieces of certain types.Second, semi-automatic annotation, where theuser can tell the system the type of a piece ofinformation which the system could not unam-biguously identify so that it can be wrappedinto a semantic tag. Third, manual tagging,where the process of annotation is entirely leftup to the user.

In our recommender system, we are mainlyinterested in automatic annotation. Here, thesystem processes content of pages and portletsusing unstructured information analysis meth-ods in order to automatically extract entitiesof certain types, such as persons, locations, etc.The system then enriches the extracted entitiesby wrapping them into semantic tags. After-wards these semantic tags will be, dependingon the information type, associated with appli-cation logic that allows for an advanced inter-action with the information piece.

For example, if a portlet is comprised of textin which the name of a person Alice, the lo-cation New York or the abbreviation WWWoccurs, our system analyzes the entire markup,identifies the information pieces just mentionedand wraps them into semantic tags. The addi-tional logic associated to information pieces oftype person allows to lookup more detailed in-formation about the person.

Within our solution a click on the semantictag wrapping a person’s name results in a pop-up appearing which provides more detailed in-formation about the person which is retrievedfrom his profile stored in the company’s em-ployee directory. Similarly, a click in the se-mantic tag wrapping a location results in a pop-

up appearing which provides details about thislocation by displaying it on a Google Map.

3.1.2 Recommending Background In-formation

Initially the World Wide Web was basically aread-only medium. Authors put their staticdocuments onto webservers which users readvia their browsers. It was up to these authorsto decide to which (external) sources to linkin order to provide additional background in-formation. However, in our today’s Web 2.0world content is created by entire user com-munities. Social networking sites, blogs, andwikis facilitate collaboration and sharing be-tween users. Users do not only retrieve infor-mation anymore, they contribute content.

Due to the large amount of different userscontributing Web 2.0 sites grow quickly and,most often, in a more uncoordinated way thancentrally controlled sites. Different users usedifferent terms to describe the same things.Some terms might be well-understood by mostusers, some might not. Web 2.0 communitiesare often heterogenous with widely differinguser expertise.

Thus looking up terms is needed more fre-quently and becomes a tedious task. But whenreading web sites, users want background infor-mation at their fingertips. If they do not un-derstand what an abbreviation or a term standsfor, who a certain person actually is, or, wherea certain city is actually located, they want tobe able to retrieve this information as easilyand quickly as possible. They do not want tofire up a search engine to search for another sitefrom which they could probably get the infor-mation they want, but rather be provided withthat information directly, in-place.

In our Web 2.0 world a lot of this informa-tion is available via RSS/Atom feeds, via webservices or simply as part of ”traditional” webpages. What is missing is an environment thatidentifies information pieces users would typi-cally want to lookup and means to access addi-tional (external) information sources that canprovide users with the necessary backgroundinformation. We want to provide such an en-vironment which unobtrusively enriches the in-formation pieces this way.

4

Page 5: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

3.1.3 Recommending Related Content

Analyzing occurrences of semantically taggedinformation pieces also allows us to recommendrelated content. For instance, if the term Web-Sphere Portal is identified in a news portlet andhence semantically tagged as a product name,our system would provide users with back-ground information about WebSphere Portalprobably by linking to the product site.

However, within a Portal system, the sameterm might occur at many other places, e.g. ina wiki portlet where users have posted somebest practices, tips and tricks when workingwith this product, in a blog where users havecommented on the product, and so forth. Wetrack all occurrences and recommend an appro-priate subset of them as related content as soonas the user interacts with one single occurrence.The order in which recommendations are listeddepends on how often people have interactedwith a certain occurrence so far.

This can even be taken one step further. Wealso allow users to annotate already seman-tically tagged information pieces. This waywe can recommend related content not onlyby having identified ”exactly matching” oc-currences of semantically tagged informationpieces, but also by having identified similarlyannotated, but differently semantically tagged,information pieces. In [15], we have describedan algorithm that allows us to determine relat-edness between even differently annotated re-sources which allows us to provide even moreand accurate recommendations to related con-tent.

The power of the concepts just described isbased on the fact that we make use of the en-tire communities’ collective intelligence in ad-dition to the fully automatic semantic annota-tion mechanism. The entire community cannotonly contribute content anymore, but also de-scribe relations between information pieces be-ing available. As collective intelligence alwaysoutperforms single users’ [20], we can assumethat the community will always be able to cate-gorize content better than even the best authorcould.

Another general assumption is that annotat-ing expresses interest in a resource. Hence, wecan assume that information pieces being anno-

tated more often by a user are of higher impor-tance to him. Even more important: since an-notating is a collaborative process we can alsoassume that resources being annotated moreoften by all users are of higher importance tothe entire community. Thus, analyzing users’annotation behavior allows us to better under-stand their interests and preferences and allowsus to recommend ”more interesting” contenteasier.

3.1.4 Web Personalization

Today we experience not only the massivegrowth of information on the Internet and com-panies’ intranets, but also the emergence of newforms of information. Users find the informa-tion they need in various systems, such as newsportals, online shops, geo-information systems,blogs, wikis, and many others. All these sys-tems appear to be potential sources for back-ground information and related content, hencecan be used for providing recommendations onarbitrary number of extracted entities.

On the other hand, the availability of hugeamounts of data can become a serious threatto the usability. Large number of recom-mendations might distract the user from themain content and cause the user’s frustration.Therefore, we need a mechanism being able toelicit the user interests and provide only thecontent that is relevant to these interests. Suchmechanism is defined as personalization.

In the area of Web Information Systems, per-sonalization is a process of adapting Web con-tent to the needs of individual users or a groupof users based on the knowledge about the userinterests. The goal of personalization is to pro-vide users with the information they need with-out asking them explicitly for it. Personaliza-tion is a broad area and includes recommendersystems, adaptation, and customization [13].

In our particular case, we aim to use per-sonalization techniques in order to provide rec-ommendations to background information andrelated content that is relevant to the user’s in-formation needs. Our method uses the knowl-edge about users to select the pieces of informa-tion among the automatically extracted entitiesthat match the user interests. These selectedpieces are further linked to external resources

5

Page 6: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

Figure 1: Layered architecture

in order to issue the personalized recommenda-tions.

The key component of personalization is theuser model that describes who the users are,what interests they have, and how they behave.The user models vary in terms of informationthey store, so do the approaches and techniquesused to construct these models. In our recom-mender system, we use a hybrid approach tocollect knowledge about users.

3.2 Basic Recommender System

In this section, we describe the architecture ofour earlier system. This system is able to auto-matically annotate Portal resources and to usethese annotations to provide background infor-mation and recommend related content.

The architecture of our basic recommendersystem is six-layered (cp. Fig. 1): content (inthe form of markup) is delivered by the Portalor, to be more precise, by portlets residing onpages part of the Portal. The content is thenanalyzed by the analysis engines part of ourAnalysis Layer. Here, so called Annotators ex-

tract information pieces like people, locationsetc. from the markup received. At the Seman-tic Tagging Layer the results of the analysisare converted into proper markup. This meansthat in this step we construct markup which wecan wrap around identified information pieces;in other words, in this step we perform theactual semantic tagging. The Recommenda-tion Layer determines other occurrences of thesame semantic tags and occurrences of simi-larly user-annotated information pieces. At theservice integration layer we determine the (ex-ternal) services to which we could connect foreach single semantic tag to provide users withadditional information. For example, for a se-mantic tag corresponding to a person we couldconnect to the company’s employee directory,or to Google Maps to visualize the work loca-tion of the person. At the Presentation Layerwe associate (client-side) application logic tothe semantic tags that allows for the actual in-teraction (i.e. for the invocation of (external)services etc.).

We illustrate the details of our architecturein the following sections.

Portal Layer. Portals provide users with acentral point of access to information. They ag-gregate information and services from varioussources.

Portals are comprised of pages which arecomprised of portlets. Portlets are applica-tions that provide specific pieces of content.They are managed by a portlet container, thatprocesses requests and generates dynamic con-tent. A portlet is a pluggable user interfacecomponent. The content that is generated bya portlet is called a fragment. In contrast toa servlet that generates complete documents,the fragments generated by a portlet are piecesof markup (e.g. HTML, XHTML). In its turnthese fragments can be aggregated with otherfragments to form a complete document. Usu-ally several portlets are aggregated to a portalpage that represents, for instance, a completeHTML document.

Filters make common aspects available toportlets. Instead of implementing a certainfunctionality within each portlet, a filter im-plements the aspect centrally. Different filterscan be combined via a filter chain.

6

Page 7: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

Figure 2: UIMA Analysis Engine

We use filters to change the response (i.e.to ”enrich” the original markup by injectingsemantic tags) before it reaches the client. Thismeans that instead of simply transmitting themarkup from the Portal to the client we passthe markup to the analysis and its subsequentlayers to enrich it before we deliver it to theclient.

Analysis Layer. For analyzing unstructuredinformation such as text residing inside portletsour original system relied on the Unstruc-tured Information Management Architecture(UIMA). UIMA is a software architecture fordeveloping and deploying unstructured infor-mation management (UIM) applications. Typ-ically such UIM applications analyze large vol-umes of unstructured information in order tostructure the relevant knowledge to the enduser.

Typical UIM applications make use of a va-riety of technologies including statistical andrule-based natural language processing (NLP),information retrieval, machine learning, on-tologies, and automated reasoning. UIM ap-plication may also include structured datasources like databases or wordlists to help re-solve the semantics of the unstructured source.UIMA gives developers the ability to create

component-based UIM applications.Our UIMA analysis engine is constructed

from components called Annotators. The An-notators implement the actual analysis logic.They analyze an artifact (e.g. a text docu-ment) and create metadata about that artifact.An Analysis Engine (AE) may contain a sin-gle Annotator (Primitive AE ), or it may bea composition of others and therefore containmultiple Annotators (Aggregated AE ).

Fig. 2 shows an Aggregated AE that consistsof three Annotators: Tokenizer, Regex Anno-tator, and People Annotator, where the latteris also an Aggregated AE containing a WordlistAnnotator and a Directory Annotator.

The Tokenizer splits the input into TokenAnnotations using a simple whitespace segmen-tation. The result of tokenization is a set oftokens and sentence annotations. For example,the text ”The author of this paper is AndreasNauerz” is tokenized into eight tokens and an-notated as a sentence.

The Regex Annotator retrieves the tokensfrom the Token Annotator’s Annotation In-dex and compares each token against a regu-lar expression that matches potentially a per-son name and creates a PotentialPersonNameAnnotation. In our sample the words Andreasand Nauerz both start with an uppercase char-acter, which might be a good indicator to rep-resent a name within an English text. Afterthat the first aggregated Wordlist Annotatorthat belongs to the People Annotator comparesthe PotentialPersonName’s first name with awordlist that contains common first names. Be-cause Andreas is a common first name thewordlist includes it.

The next Annotation comes from the RegexAnnotator too, but it does not occur in thewordlist because Nauerz is not a common firstname. Nevertheless, due to the minimal dis-tance between these Annotations the WordlistAnnotator decides to create a PersonNameAnnotation where it sets the so called fea-tures firstName and lastName to Andreas andNauerz respectively.

The final Directory Annotator queries theemployee directory to prove if a person withthis name exists. For stored entries of peopleit creates a Person Annotation instance that isadditionally extended with other information

7

Page 8: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

from the directory (e.g. birth of date, phonenumber, office).

The result of an Annotator’s work are socalled Feature Structures, which are data struc-tures that have a type and a set of (attribute,value) pairs. An annotation is a particular typeof Feature Structure that is attached to a regionof the artifact being analyzed (e.g. a span oftext in a document).

All Feature Structures, including annota-tions, are represented in the UIMA CommonAnalysis Structure (CAS ). The CAS is the cen-tral data structure through which all UIMAcomponents communicate. In fig. 2 we haveillustrated how one CAS is passed from one An-notator to another. Each Annotator can makecontributions to the CAS ’ Annotation Index.

The Annotators of an Aggregated AE are ar-ranged in a Flow. In the example above the AEuses the simplest possible Flow, a Fixed Flowthat is included in the UIMA framework. Thepurpose of a Flow is to decide, which Annota-tor is included in a analysis. Because a Flowhas access to the CAS, very sophisticated Flowscan be implemented by Annotator developers.

Semantic Tagging Layer. At this level theresults of the annotation must be convertedinto proper markup. This layer needs to knowabout the annotation objects created by theanalysis layer. The decision, if and how an-notation objects are rendered to semantic tagsis made here. This means that this layer is re-sponsible for generating valid HTML fragmentsrepresenting semantic tags which can later beinjected into the original text (or markup, re-spectively).

Recommendation Layer. The Recommen-dation Layer constructs, for each semanticallytagged information piece, a list comprised ofreferences to the other ”exactly matching” se-mantically tagged information pieces. Thismeans that when a user for example clicks theinformation piece Andreas Nauerz (a personname obviously), he will be provided with alist of all other places where this informationpiece occurs in the Portal system.

The layer also constructs a similar list com-prised of references to ”similarly” annotated in-formation pieces. This means that when a user

is for example clicking the previous introducedperson Alice, he will be provided with a list ofall places where related persons (in the sam-ple given Bob and Charly) occur in the Portalsystem.

Service Integration Layer. The ServiceIntegration Layer scans the markup that hasmeanwhile been enriched with semantic tags.It determines the so called External ServiceConnectors that are available for each type ofsemantic tag and generates the necessary invo-cation code. The available services are deter-mined via the Service Registry that maps eachinformation type to its available services. Aservice connector is a piece of application logicthat allows to connect to an external servicelike e.g. Google Maps.

Presentation Layer. The PresentationLayer scans the markup that has meanwhilebeen enriched with semantic tags which ad-ditionally have been assigned the informationwhich recommended content and which Exter-nal Service Connectors are available, too. Atthis layer we add (client-side) application logicfor each semantic tag allowing for the actualinvocation of the external services or for ac-cessing related content.

3.3 Limitations

Our system described above provides basic rec-ommender functionalities allowing users to ac-cess related content and background informa-tion on the entities that were extracted auto-matically by the UIMA analysis engines. How-ever, after an initial evaluation of the system,we have identified a number of limitations anddrawbacks:

1. Irrelevant recommendations. The sys-tem provides additional information on ev-ery object that was tagged by the analysisengine without respect to the relevance ofthe tagged object to the user. The sys-tem does not take into account user inter-ests and preferences. Therefore, it leads toa large number of irrelevant recommenda-tions that harm the overall usability of thesystem.

8

Page 9: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

2. Hardcoded binding of informationtypes to sources of related content.In our basic recommender system, the in-formation types are mapped to the sourcesof additional information by the developer.This makes it impossible for the admin-istrators of the systems to specify newsources of related content and backgroundinformation.

3. Huge amount of work required to de-velop analysis engines. Development ofunstructured information analysis engines,in general, is a technologically complicatedprocess, which, in most cases, requires alarge corpus of annotated documents fortraining the models and extensive vocab-ularies of terms and concepts in a certaindomain as well as sophisticated regular ex-pression patterns and machine learning al-gorithms. Therefore, development of ananalysis engine for a new information typemay become unacceptably expensive.

4 Architectural Extensions

Considering the limitations and drawbacks ofour basic recommender system, we proposea number of architectural extensions and im-provements. After a brief overview, we will de-scribe each of them in detail in this section.

1. Generation of user-specific recom-mendations. The recommender systemmust take into account user interests andpreferences in order to provide recommen-dations tailored to the needs of individ-ual users. In order to achieve this, thesystem obviously needs some knowledgeabout the user. Therefore, we have addeda user model to our architecture. This usermodel contains static and dynamic infor-mation about the user, in particular abouther interests and areas of expertise.

2. Mechanism for flexible mapping ofinformation types to informationsources. We also need a flexible mecha-nism to specify the rules that govern whatsources should be queried for additional in-formation given a certain concept that theuser is interested in. For this purpose, we

have introduced a personalization model,which allows multidimensional representa-tion of factors that influence the decisionon what information to deliver.

3. Harnessing external unstructured in-formation analysis engines. In orderto reduce the amount of work required fordevelopment of unstructured informationanalysis engines, the system must be ableto use external engines and services for au-tomatic tagging. As an example, we pro-vide an integration with the Calais serviceand show how this (currently freely avail-able) service can be used instead of ourcustom built UIMA engine.

In order to implement the above-mentionedfeatures, we have made a number of exten-sions. Fig.3 shows the system architecture ofour extended recommender system. We intro-duced a Domain Model that defines top-leveland domain-specific concepts and relationsamong them as well as a Task Model that de-fines common and domain-specific information-gathering actions. The layered architectureof the old system (fig.1) was extended bya Personalization Layer that resides betweenthe Analysis Layer and the Semantic TaggingLayer. The layer includes a User Model thatcontains knowledge about user features and aPersonalization Model that defines the rulesgoverning what information should be deliv-ered to the user when she encounters a cer-tain object on a Portal page. Finally, theUIMA analysis engines, located on the Anal-ysis Layer, were supplemented by an externalunstructured text analysis service, in this caseCalais. The proposed architectural improve-ments are described in more detail in the fol-lowing sections.

4.1 Domain Model

The key element in our system that both theuser model and the concepts-actions model re-fer to is the Domain Model. A domain model isa structure that defines concepts and relationsamong them in a given domain, e.g., finance,medicine, biology, etc [2]. We have chosen thefinance domain for our proof-of-concept imple-mentation. Therefore, in our domain model

9

Page 10: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

Figure 3: Extended System Architecture

we define the concepts that the users from fi-nancial realm may work with, such concepts asstock, bank, account, etc.

In our recommender system, the domainmodel is represented as an ontology, which de-fines concepts as ontological classes by spec-ifying their properties and relations to otherclasses. E.g., an event Acquisition, denotingthe fact of acquiring one company by anotherone, can be described as a subconcept of Fi-nancialTransaction concept and described bysuch attributes as date, acquiree, acquirer, andtransationAmount.

The domain model also contains instancesof concepts. E.g., for concept Company, themodel contains such instances as ”Microsoft”,”IBM”, ”Google”, etc. Class instances mightbe required to represent specific user interestsin the user model. For example, our manageris reading a news article about an acquisitionof Company A by Company B. The managerhas been dealing with the former company formany years, hence she possesses expert knowl-edge on it. However, she has never heard any-

thing about the latter company, therefore, shemay need background information on it. In or-der to model such situation in the user model,we need to have instances of Company A andCompany B explicitly defined in the domainmodel.

4.2 Task Model

In addition to the definition of concepts and re-lationships between them, we need a structurethat defines domain tasks, the actions that canbe performed by users in a certain domain. Wespecify such actions in a Task Model. Here,we concentrate on the information-gatheringactivities that could be taken by the user ona certain object contained in the text she isreading. The model itself is represented as anontology. Each task is defined as an ontologi-cal concept and described by the set of inputand output concepts. E.g., for the task get-CompanyAddress, we specify Company as aninput concept and Address as an output con-cept. The input and output concepts are de-

10

Page 11: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

rived from the domain model.As we have chosen finance domain for

our proof-of-concept implementation, the taskmodel in our system defines most of theinformation-gathering actions performed byemployees of financial organizations, e.g.: get-StockQuotes, getCurrencyExchangeRates, get-MarketStatistics, etc.

The information-gathering activities areused in the Personalization Model to define theinformation that should be provided to the userwhen a certain object is present in the docu-ment.

4.3 User Model

Many recommender engines need informationabout the user to generate relevant recommen-dations. In order to provide these engines withthe necessary input, we construct a User Modelreflecting various user features. Our user modelconsists of two parts: static and dynamic. Thestatic part represents the user’s demographiccharacteristics, such as date of birth, gender,mother tongue, etc. The dynamic part of themodel represents the user interests and exper-tise.

The dynamic part is constructed as an over-lay model [4]: The interests and expertise of anindividual user are represented as references tothe concepts defined in the domain model. Alsowe indicate the values that show the degree ofthe user’s interest and expertise in these con-cepts. To specify these values, we use set of lin-guistic variables: {not interested, partially in-terested, interested} and {novice, medium, ex-pert}. We associate these variables with fuzzysets, which are defined by the correspondingmembership functions[22, 12].

The membership functions are based on theconcept frequency value, which denotes impor-tance of a certain concept for the user. To com-pute this value we need to know what conceptsand how many times the user has encounteredin the past. For this purpose, we analyze con-tent of the pages accessed by the user and ex-tract certain named entities using the CalaisWeb service and UIMA framework. These twoanalysis engines extract such entities as indus-try term, technology, company, location, etc.Occurrences of those entities are recorded in

Figure 4: Membership functions for user inter-ests

Figure 5: Membership functions for user exper-tise

the user log, which allows us to estimate im-portance of a concept by computing the con-cept frequency value:

CFi,j=ci,j∑k

ck,j

(1)

where c is the number of occurrences of theconcepti for the userj , and the denominator isthe total number of occurrences of all conceptsregistered for the userj . The computed con-cept frequency is a real number that can takevalues from 0 to 1. We use this value to de-fine the membership functions that representthe degree of user interest; µni, µpi, µi showthe degree to which the user is not interested,partially interested, and interested in the givenconcept (Figure 4). In the same way we de-fine the membership functions to represent thedegree of user expertise; µn, µm, µe show thedegree of the user being novice, medium, andexpert in a given concept (Figure 5).

11

Page 12: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

4.4 Personalization Model

In the Personalization Model, we define thepersonalization rules that govern what infor-mation is provided to the user. These rules ba-sically define which information-gathering ac-tions should be delivered to a user with certaininterests and knowledge when she encounters acertain concept in the text she is reading. Thepersonalization rules are specified in the Event-Condition-Actions[6] form as shown in Listing1, where event denotes a situation when theuser encounters a certain concept in the docu-ment she is reading, condition is combinationof user features and context descriptions, andactions are the information gathering-actionsthat should be provided to the user when theevent occurs.

on(event)

if(condition)

then(actions)

Listing 1: Personalization Rule Formula

In order to combine different user featuresand context descriptors, we represent them asdimensions. In its turn, the actions are spec-ified at the intersections of these dimensions.For instance, in order to represent a personal-ization rule for an event when a student of abusiness school, interested in banking, with noknowledge in this field, is reading a news arti-cle that contains a bank in the text, we need tocreate three dimensions: User Interests, UserExpertise, and Document Concept and plot val-ues ”Banking”, ”Novice”, and ”Bank” respec-tively. At the intersection of these values, wespecify what information should be delivered tothe user, which in this case, could be the web-site of the bank, an encyclopedia article aboutthe bank, and news related to this bank (Figure6).

4.5 Service Registry

The Service Registry is a central database forstoring information about internal and exter-nal sources for background information and re-lated content. The registry maps each action

Figure 6: An example of multidimensional rep-resentation of personalization rules

defined in the Task Model to the service that”does” that action. E.g., for the action getMap,we specify the Google Maps service, which dis-plays a given location or address on the map.Other services might be internal to the portal,e.g., there could be a service that finds relatedcontent by searching for pieces of informationwith the same tags as the piece of informationcurrently considered.

Every service is provided with a WSDLdescription (Web Service Description Lan-guage3), which specifies the technical detailsrequired to execute the service. Invocation ofthe internal and external services is carried outby the Service Connector.

4.6 Calais Web Service Integra-tion

In addition to the UIMA analysis engines, wepropose a component that harnesses the CalaisWeb service for semantic tagging of text doc-uments. Calais is a RESTful service that canreceive an HTML, XML, or plain text docu-ment as an input and provide back a semanti-cally annotated document in RDF format. Theservice extracts business-related entities suchas company, currency, industry term, technol-ogy, etc. It also supports extraction of cer-tain events and facts, such as business relation,bankruptcy, company investments, etc.

3http://www.w3.org/TR/wsdl

12

Page 13: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

Figure 7: RDF Output

For example, if we provide the Calais ser-vice with the following string as an input”IBM reported its Q3 earnings report in Ar-monk, N.Y.”, the returned result will containa metadata structure containing four descrip-tions: ”IBM” as a Company, ”IBM reportedits Q3 earnings” as a CompanyEarningsAn-nouncement, ”Armonk” as a City, and ”N.Y.”as a StateOrProvince. The analysis result isrepresented in RDF format as it is shown infig. 7.

Integration with the Calais Web service con-siderably alleviates the development task. Incontrast to the UIMA analysis engines, theCalais services does not require manually an-notated corpus of documents for training theanalysis engines or regular expression pat-terns.

5 Conclusion and FutureWork

In this paper we have presented our approachto an automated, personalized recommenda-

tion of background information and relatedcontent in portal systems. The main buildingblocks of our approach are: Semantic annota-tions of content via UIMA or other services likeCalais, a concept-actions model that links se-mantic tags to possible actions available for thistype of concept, a service registry that allowsto find services that are able to perform a cer-tain action, a user model that contains informa-tion about user interests and expertise, and apersonalization engine that chooses which con-tent to recommend which background informa-tion on based on the user interest and exper-tise. This approach is an extension of an earlierapproach that also allowed for automatic rec-ommendation but had several drawbacks: Rec-ommendations were not user-specific, linkingbetween concepts and services was static, andwe needed a custom-made annotation engine tofind the semantic tags. We believe that the ap-proach proposed in this paper overcomes theseshortcomings. We are currently extending theexisting implementation which has been donewithin IBM’s Websphere Portal, to accomo-date the extensions. Once the implementation

13

Page 14: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

is completed, we intend to run thorough userstudies to compare the two approaches. De-spite its shortcomings, already the earlier ap-proach was positively evaluated by test users.

Acknowledgements andTrademarks

We would like to thank Thomas Fischer, a stu-dent at the Department of Business Informaticsat the University of Jena, for his contributionto this paper.

IBM and WebSphere are trademarks of In-ternational Business Machines Corporation inthe United States, other countries or both.Other company, product and service namesmay be trademarks or service marks of others.

About the Authors

Andreas Nauerz holds BSc. and MSc. de-grees in Information Technology and ComputerScience. He is currently working at the IBMLaboratories at Boeblingen, Germany and is,at the same time, PhD student acting as thehead of the Minerva research group at the Uni-versity of Jena. His research focus lies onthe application of adaptation and recommenda-tion technologies to Web Portals by leveragingmodern Web 2.0 and Social Computing tech-niques. Andreas has been co-organizer, com-mittee member and reviewer for various con-ferences, workshops and journals.

Fedor Bakalov received his M.Eng. in In-formation Management from Asian Institute ofTechnology in Thailand in 2007. He is cur-rently working towards his Ph.D. at the Uni-versity of Jena. His research interests includeadaptive hypermedia, mashups, user modeling,and semantic web.

Birgitta Konig-Ries holds the Heinz-Nixdorf Endowed Chair for Practical ComputerScience at the University of Jena. Her researchinterests include the next-generation of portals,semantic web technologies and their evaluation,and more generally the transparent integrationof distributed resources.

Martin Welsch is a senior member of theIBM WebSphere Portal development team at

the IBM Laboratories in Boeblingen, Germanyand Honorary Professor for Practical Com-puter Science at the University of Jena. Hisresearch and teaching interests include portalweb technologies, pervasive computing and op-erating systems.

References

[1] Palakorn Achananuparp, Hyoil Han, OlfaNasraoui, and Roberta Johnson. Semanti-cally enhanced user modeling. In YookunCho, Roger L. Wainwright, Hisham Had-dad, Sung Y. Shin, and Yong Wan Koo,editors, SAC, pages 1335–1339. ACM,2007.

[2] A. Gomez-Perez and. Ontological En-gineering. Advanced Information andKnowlege Processing. Springer, 2003.

[3] Anupriya Ankolekar and Denny Vran-decic. Kalpana - enabling client-sideweb personalization. In Erik Duval,editor, Proceedings of HT’08 - Hyper-text 2008, Pittsburgh, Pennsylvania, JUN2008. ACM, ACM.

[4] Peter Brusilovsky and Eva Millan. Usermodels for adaptive hypermedia and adap-tive educational systems. In PeterBrusilovsky, Alfred Kobsa, and WolfgangNejdl, editors, The Adaptive Web: Meth-ods and Strategies of Web Personalization,volume 4321 of Lecture Notes in ComputerScience, chapter 1, pages 3–53. Springer,Berlin, Heidelberg, 2007.

[5] Paul A. Chirita, Stefania Costache, Wolf-gang Nejdl, and Siegfried Handschuh. P-tag: Large scale automatic generationof personalized annotation tags for theweb. In Carey L. Williamson, Mary EllenZurko, Peter F. Patel-Schneider, andPrashant J. Shenoy, editors, Proceedingsof the 16th International Conference onWorld Wide Web, pages 845–854. ACM,2007.

[6] Corporate Act-Net Consortium. The ac-tive database management system mani-festo: a rulebase of adbms features. SIG-MOD Rec., 25(3):40–49, 1996.

14

Page 15: Personalized Recommendation of Related Content Based on ... · Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov

[7] Stephen Dill, Nadav Eiron, David Gib-son, Daniel Gruhl, Ramanathan V. Guha,Anant Jhingran, Tapas Kanungo, SridharRajagopalan, Andrew Tomkins, John A.Tomlin, and Jason Y. Zien. Semtag andseeker: bootstrapping the semantic webvia automated semantic annotation. InWWW, pages 178–186, 2003.

[8] M. Erdmann, A. Maedche, H. Schnurr,and S. Staab. From manual to semi-automatic semantic annotation: Aboutontology-based text annotation tools,2000.

[9] Siegfried Handschuh and Steffen Staab.Authoring and annotation of web pages incream. In WWW, pages 462–473, 2002.

[10] Siegfried Handschuh, Steffen Staab, andRaphael Volz. On deep annotation. InWWW, pages 431–438, 2003.

[11] Jose Kahan, Marja-Riitta Koivunen, EricPrud’hommeaux, and Ralph R. Swick.Annotea: an open rdf infrastructure forshared web annotations. Computer Net-works, 39(5):589–608, 2002.

[12] Alenka Kavcic. Fuzzy user modelingfor adaptation in educational hypermedia.Systems, Man, and Cybernetics, Part C:Applications and Reviews, IEEE Transac-tions on, 34(4):439–449, Nov. 2004.

[13] Maurice D. Mulvenna, Sarabjot S. Anand,and Alex G. Buchner. Personalization onthe net using web mining: introduction.Commun. ACM, 43(8):122–125, 2000.

[14] Andreas Nauerz, Michael Juninger, andShiwan Zhao. A recommender based onautomatic metadata extraction and user-driven collaborative annotating. In ReC-oll’2008: International Workshop on Rec-ommendation and Collaboration, 2008.

[15] Andreas Nauerz, Stefan Pietschmann,Rene Pietzsch, and Shiwan Zhao. A frame-work for tag-based adaptation in web por-tals. In WWW, 2008.

[16] Natalya Fridman Noy, Michael Sintek,Stefan Decker, Monica Crubezy, Ray W.

Fergerson, and Mark A. Musen. Creat-ing semantic web contents with protege-2000. IEEE Intelligent Systems, 16(2):60–71, 2001.

[17] Ce’lia Da Costa Pereira and Andrea Tet-tamanzi. An evolutionary approach toontology-based user model acquisition. InVito Di Gesu‘, Francesco Masulli, and Al-fredo Petrosino, editors, WILF, volume2955 of Lecture Notes in Computer Sci-ence, pages 25–32. Springer, 2003.

[18] Arnaud Sahuguet and Fabien Azavant.Looking at the web through XML glasses.In Conference on Cooperative InformationSystems, pages 148–159, 1999.

[19] S. Staab, A. Maedche, and S. Handschuh.An annotation framework for the semanticweb, 2001.

[20] James Surowiecki. The wisdom of crowds.American Journal of Physics, 75(2):190–192, February 2007.

[21] Hsin-Chang Yang. Bridging the www tothe semantic web by automatic semantictagging of web pages. In CIT, pages 238–242, 2005.

[22] L. A. Zadeh. Fuzzy sets. Information andControl, 8(3):338–353, June 1965.

[23] Lei Zhang and Yong Yu. Learning to gen-erate cgs from domain specific sentences.In ICCS, pages 44–57, 2001.

15