Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

1/83

Alma Mater Studiorum

Universita di Bologna

II Facolta di Ingegneria

Corso di Ingegneria Informatica

Laurea Magistrale in Sistemi Distribuiti

Molecules of knowledge: a new

approach to knowledge

production, management and

consumption

Candidato RelatoreStefano Mariani Prof. Andrea Omicini

Anno Accademico 2010/2011 - Sessione II


2/83

.


3/83

.

Ad Alice, perche senza di lei sarei solo,

ai miei genitori, che mi hanno dato questa possibilita,

a mio fratello, che era meglio se giocavi a WoW,ai miei nonni, che avrei voluto fossero qu,

a tutti i miei amici, la cui provvidenziale ironia

mi ricorda sempre di non prendermi troppo sul serio.


4/83


5/83

Contents

Introduction 7

1 Background 11

1 My vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 The biochemical metaphore . . . . . . . . . . . . . . . . . . . 15

3 IPTCs news standards . . . . . . . . . . . . . . . . . . . . . . 20

3.1 NewsML . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 NITF . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Molecules of knowledge model 31

1 Informal introduction to the model . . . . . . . . . . . . . . . 31

1.1 About topology . . . . . . . . . . . . . . . . . . . . . . 35

2 Model abstractions . . . . . . . . . . . . . . . . . . . . . . . . 36

2.1 Seeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2 Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.3 Molecules . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.4 Chemical reactions . . . . . . . . . . . . . . . . . . . . 43

2.5 Catalysts/Inhibitors . . . . . . . . . . . . . . . . . . . 48

3 The spatial-temporal fabric toward self-adaptation . . . . . . . 52

3.1 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2 Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3 Self-adaptation . . . . . . . . . . . . . . . . . . . . . . 55

4 The formal model . . . . . . . . . . . . . . . . . . . . . . . . . 57

3 Model behaviour examples 61

1 Seeds generating atoms . . . . . . . . . . . . . . . . . . . . . . 62


6/83

6 Contents

2 Diffusion, decay and positive feedback . . . . . . . . . . . . . . 65

3 Molecules from atoms . . . . . . . . . . . . . . . . . . . . . . . 68

Conclusion and further developments 71

Appendice - Sommario in italiano 75

Bibliography 79

Acknowledgments 83


7/83

Introduction

Information specialists, namely journalists, are facing new and critical chal-

lenges in their knowledge production process: the increasing amount of in-

formation to mine, the pace at which its made available and all the different

formats and paradigms existing to represent and think of it are just a few

to mention.

A new field is emerging to promote the process: computational journalism.

By developing techniques, methods, and user interfaces for exploring the new

landscape of information, computer scientists can help discover, verify, and

even publish new public-interest stories at lower cost. For computational-

ists and journalists to work together to create a new generation of reporting

methods, each needs an understanding of how the other views data. Jour-

nalists are in fact a special kind of information-seekers, because they look

for the unusual handful of individual items that might point toward a news

story or an emerging narrative thread.

Over the past two years, Sarah Cohen, James T. Hamilton, and Fred Turner

have conducted scores of interviews with reporters, editors, computer scien-

tists, information experts, and other domain researchers to identify collabo-rations and projects that could help reduce the cost and difficulty of news

production and knowledge management [1]. Their conversations identified

five areas of opportunity:

Combining information from varied digital sources. The capability to put

into one repository material not easily recovered or searched through

existing search engines is currently missing quite at all, because the

only thing journalists can do actually is to manually mine interesting


8/83

8 Introduction

sites and take annotations. This is due to the eterogeneity of the form

and format according to which each source of information publish and

organize its contents.

Information extraction. Beat reporters might cover one or more counties, a

subject, an industry, or a group of agencies, hence most of the docu-

ments they obtain would benefit from entity extraction. But effective

use of these tools requires computational knowledge beyond that of

most reporters, documents already organized, recognized, and format-

ted, or an investment in commercial tools typically beyond the reach

of news outlets in non-mission-critical functions.

Document exploration and redundancy. Reporters need to notice informa-

tion that is not commonly known but that could lead to news in in-

terviews, documents, and other published sources. Though, the recent

explosion in blogs, aggregated news sites, and special-interest group

compilations of information makes distinguishing new stories time con-

suming and difficult, hence the ability to group documents in interesting

ways would immediately reduce the time and effort of reporting.

Audio and video indexing. Unless a third party has already transcribed,

closed-captioned, or applied speech-recognition techniques on the record,

most reporters have no way to move to the portion of it that contains

what may be of interest. Existing technology is probably adequate for

reporters immediate needs, but as these interviews suggest there arent

simple user interfaces to the technology that would allow unsophisti-

cated users to test the technology on their own recordings.

Extracting data from forms and reports. Much of the information collected

by reporters arrives in two genres: original forms submitted to or cre-

ated by news agencies, often handwritten, and reports generated from

larger systems, sometimes electronically and sometimes on paper. Jour-

nalists have few choices today: retype key documents into a database,

attempt to search recognized images, or simply read them and take

notes. Extracting meaningful information from forms is among the most

expensive and time-consuming job in large news investigations: its cost

sometimes results in abandoning promising stories.


9/83

Introduction 9

This thesis will mainly focus on the third issue, that is Document exploration

and redundancy. The main objective in fact, is to provide knowledge pro-

sumers, hence both producers and consumers as tipically journalists are, a

brand new model both to think at knowledge lifecycle under a brand new

perspective and also to shape knowledge and the knowledge production pro-

cess itself accordingly.

Althought the work done in this thesis is tailored to the application domain

of journalism, hence most of the time knowledge actually means journalistic

news to me, most of its ingredients and ideas are easily reproducible in

other areas, namely wherever a self-organising knowledge management sys-

tem is needed [2]. Moreover, the model here conceived can be easily extended

to deal with each of the previous highlighted issues: in fact, some of them are

assumed and some others can be covered as will be mentioned throughout

the thesis.

The remainder of the paper is organized as follows:

Chapter 1 introduces some background information necessary to bet-

ter understand the model, namely the biochemical metaphore for dis-

tributed coordination systems and the IPTC NewsML and NITF jour-

nalistic standards to represent news content, structure and semantics

in a machine-readable format;

Chapter 2 defines the molecules of knowledge model and how it could

be used to design a self-organising news management system;

Chapter 3 shows some brief sperimentation Ive done to observe how

the model behaves;

then I draw conclusions about the work done and guidelines for further

investigations.


10/83


11/83

Chapter 1

Background

La disumanita del computer sta nel fatto che,

una volta programmato e messo in funzione,

si comporta in maniera perfettamente onesta.

- Isaac Asimov -

As a first thing, I would like to describe the reader my view of the news

lifecycle and how it can be re-thinked under the brand new perspective of

the biochemical metaphore recently exploited in distributed coordination sys-tems. Then it becomes necessary to describe such metaphore, that is what

second section does. In the end, the IPTC NewsML and NITF standards are

briefly introduced since they are the foundations of my molecules of knowl-

edge model.

1 My vision

I wish to depict the overall scenario that the reader is encouraged to imagineto fruitfully understand what the following sections and chapters are talk-

ing about and what is their purpose. To this end, I think it is better to

distinguish three phases in the news lifecycle: production, management and

consumption.

Production. Journalists will gain the knowledge they need to create news

from different sources of information. Such sources could be either i)

external to the self-organising system, as RSS feeds aggregators, news


12/83

12 Background

agencies (as the Italian ANSA) broadcasts, digital-articles from online

papers, even the set of posts that are part of the same thread in a

blog; or ii) internal, such as system prosumers own articles, news and

comments/annotations to existing knowledge.

Whichever is the nature of a source of knowledge, I will assume that

either i) it is already structured or ii) there exists a proper entity, within

the system or out of it, able to do so (for instance an interface agent at

the border between the self-organising system and the external sources).

Structured information means to me that it has been built, organized

and distributed according to some standard, either a general purposeknowledge representation language as OWL2 [3] from the W3C or a

more domain specific as the IPTCs standards NewsML and NITF.

I will consider the second approach (the two standards mentioned will

be described properly in the following).

These structured information sources will be either i) reified within the

self-organising system as seeds or ii) managed again by a proper

entity (namely another interface agent). In both cases I assume that

these sources continuously and autonomously inject in the system someatoms of knowledge, which at the moment could be interpreted

as autonomous and independent living pieces of knowledge (actually

they are single NewsML/NITF tags, as will be described).

The fundamental matter is that this injection is not a one-shot op-

eration, but it is continuous in time and its rate could be changed

according to the systems state and its desired behaviour. For instance

recently published news could be injected at a higher rate, hence more

often in a given interval of time, than older ones; or if the system isoverloaded every sources could be slowed down to give it time to dis-

pose of it, while if experiencing scarcity of new atoms existing sources

could be excited to increase their injection rate.

Moreover, every single injection does not add to the self-organising

system a single atom, but a variable number of identical copies of an

atom, namely its concentration.

This quantitative information models the atoms relevance and usefull-


13/83

1.1 My vision 13

ness: the higher it is, the higher is the importance implicitly attributed

to that atom within the system by the system itself, hence the more it

is capable to influence its behaviour. Concentration may be given ei-

ther by i) the prosumers if they extract the atoms by themselves (this

manual-mode is allowed too, for instance when prosumers inject in the

system their own articles); or ii) the injector component according to

some well-defined criteria. For instance giving higher concentration i)

to atoms extracted from the title or the summary of a news rather than

those taken from its body, ii) to atoms comparing more times inside

the same news source or even iii) to newest news as done regarding

injection rate.

Mind that in the case of manually-given concentration, injection rate

too should be given (eventually later self-adjusted by the system au-

tonomously).

Management. The model this thesis wishes to build has to provide the ab-

stractions and metaphores useful to every possible system designed

upon it with the aim to help information specialists to manage their

knowledge. In particular such system could be a self-organising sys-

tem able to autonomously evolve knowledge according to users needs,

desires and behaviour. For instance it could relate atoms one to each

other to shape molecules of knowledge, hence higher-level knowl-

edge items, evolving both according to space and time patterns such

as decay and diffusion.

The main tool thanks to which the system evolves (or better its atoms

and molecules but seeds too evolve) is the chemical-like law, namely

a one-shot stochastic transition rule consuming a set of reagents to

generate a set of products. These rules have necessarly to be stochastic

to give the system as a whole the self-* properties highly desiderable

in open and distributed knowledge-intensive environments [4]. Stochas-

ticity here means that each law has an associated somehow computed

probability according to which it is scheduled for execution, namely

even less probable laws could be executed despite more probable ones.


14/83

14 Background

Such laws could be designed to combine together somehow related

atoms, ending to increase the knowledge stored within the system by

emergence [5]. Their reagents could be the atoms of knowledge while

their chemical products the afore-mentioned molecules: this way the

system could be able to self-produce molecules i) about the same peo-

ple, ii) covering the same topic, iii) relating chronologically coherent

atoms, iv) following some kind of spatial criteria and so on.

Concentration of atoms taken as reagents influences execution proba-

bility: the higher is the concentration of the atoms involved in a certain

law rather than that of atoms satisfying another laws pre-conditions,the higher is the probability that the former law will be chosen for exe-

cution over the second (althought still stochastically, hence the second

could be executed despite its lower probability).

Consumption. The creation of new knowledge from existing one by emer-

gence is useless if such knowledge is not made available to potential

consumers. To this end, the system should provide users some mecha-

nism to perceive such knowledge, hence both the single atoms andtheir aggregations, namely the molecules. This way system prosumers

may not only acquire the single pieces of information they were looking

for, but also navigate associations between them, reified as molecules.

A crucial principle to understand when talking about self-organising

systems is that perception actions carried out by users have practical

and observable consequences on the system state and behaviour: as

soon as the system is observed it suddenly changes its shape according

to such observation.

In the case of creation, modification or aggregation of existing informa-

tion by the prosumer, it is easy to detect system changes, but if such

information is only retrieved, browsed and/or navigated through with-

out any modification which are these observable consequences and how

they can be recognized? What is common for all the afore-mentioned

operations, both modifying or not knowledge, is that through them

users become aware that the considered information exists and implic-


15/83

1.2 The biochemical metaphore 15

itly evaluate such knowledge as useful/relevant to them. The system

is then allowed to interpret all these different kinds of access made to

atoms and molecules as positive feedbacks that increase their con-

centration: pieces of news managed more times and more often than

others are implicitly considered as more relevant/useful by the system

itself, hence they will gain an increased capability to influence its be-

haviour.

According to this view, prosumers can be seen as catalysts for the

chemical reactions installed in the self-organising system, able to in-

fluence its autonomous and stochastic behaviour not only due to thenature of their actions but also to the rate at which they are executed.

Pay attention to another fundamental principle, dual to the previous:

even the absence of any observation could be interpreted as an action

over the system, that as such has to change its state. This is usually

called negative feedback: an atom or molecule of knowledge that isnt

accessed for a long time does not receive any re-enforcement, hence

it should slowly fade away following some kind of implicit negative

feedback enacted by the system itself to avoid divergence (all the atomsand molecules endlessly increasing towards system saturation).

Now that the reader knows what I had in mind while writing this thesis, its

time to introduce the biochemical metaphore I will rely on.

2 The biochemical metaphore

No matter whether one thinks at natural systems using specific viewpoints,

e.g., in terms of physical systems, chemical systems, biological systems, or

social systems. In all of the perspectives one can always recognise the follow-

ing characteristics: i) above a common environmental substrate (defining

ii) the basic laws of nature and the ground on which individuals can

live), iii) individuals of different kinds (or species) interact, compete, and

combine with each other (in respect of the basic laws of nature), so as to serve

their own individual needs as well as the sustainability and the evolvability


16/83

16 Background

of the overall system.

This is the sort of endeavour that one should assume towards the realisa-

tion of long-lasting (ideally eternal) adaptive service ecosystems: conceiving

services and data components as individuals in an open ecosystem, in which

they interact according to a limited set of eco-laws to serve their own in-

dividual purposes in respect of such laws [6] [7].

Within the ecosystem, the level of species is the one in which all system

entities - persistent and temporary knowledge/data, contextual information,events and information requests, and of course software service components -

are all interpreted with the uniform abstract view of being the living things

that populate the system. After a bootstrap phase in which the ecosystem

is expected to be filled with a non-empty set of individuals, the ecosystem

starts living on its own, with the population of individuals evolving in differ-

ent ways: i) the initial set of individuals is subject to changes (as a reaction

to users actions upon it); ii) service developers and producers inject in the

system new individuals (developers insert new services and virtual devices,

producers insert data and knowledge); and iii) consumers keep observing the

environment for certain individuals (inject information requests and look for

certain data, knowledge, and events).

The environmental level determines the set of fundamental eco-laws re-

sponsible for the way in which individuals interact, compose with others,

aggregate so as to form or spawn new individuals, and decay (ultimately win-

ning or losing the natural selection process intrinsic in the ecosystem). Start-

ing from the unified description of living entities - the information/service

they provide - and from proper matching criteria, such laws basically spec-

ify the likelihood of certain spontaneous evolutions of individuals or groups

of individuals.

Typical patterns that can be driven by such laws may include: temporary

data and services decay as long as they are not exploited until disappearing,

and dually, they get reinforced when exploited; data, data requests, and data


17/83


retrieving services might altogether match, hence spawning data-found

events; new services can be created by aggregating existing services whose

descriptions strongly match.

The dynamics of the resulting ecosystem is overall determined by having in-

dividuals in the ecosystem act based on their own internal goals, yet being

subject to the eco-laws for their actions, interactions, and survival. The way

eco-laws apply may be affected by the presence and state of other individuals,

hence providing for closing the feedback loop that is a necessary charac-

teristic to enable self-organisation, self-adaptation, and self-management

features.

For instance, a service component that gets consuming too many resources

can affect the behaviour of resource provider components, diminishing their

availability, and thus avoiding the overall system to crash. Or, in a different

case, a service component being subject to a very high number of requests

can either aggregate new service components of the same class at a different

site or simply spawn itself to increase service availability without affecting

the quality of service provided.

In any case, the openness of the architecture does not exclude the possibility

of enforcing forms of decentralised human management (the existence of a

self-managing system must not preclude the possibility for humans to pre-

serve the capability of controlling the system). In particular, the injection of

new individuals can be used to modify the way eco-laws affect other individ-

uals and, thus, to somehow control the evolution of the ecosystem dynamics.

Chemical metaphores consider that the species of the ecosystem are sorts

of computational atoms/molecules, living in localised solutions, and with

properties described by some sort of semantic descriptions, intended as

the computational counterpart of the description of the bonding properties

of physical atoms and molecules. The laws that drive the overall behaviour

of the eco-system are sort of chemical laws that dictate how chemical reac-

tions and bonding between components take place to realise self-organising


18/83

18 Background

patterns and aggregations of components. Moreover, chemical metaphores

support forms of external control using sort of catalysts or reagent com-

ponents affecting the behaviour of a chemical ecosystem.

But the chemical metaphore alone is not enough, because it does not consider

any spatiality-related aspect, hence a metaphore inspired by biochemistry

(combining basic aspects of chemistry with some feature of biology) can suit-

ably enhance it to address the development of distributed service ecosystems.

On the one hand, chemistry appears a simple yet powerful framework for

self-organisation since it is based on a very foundational setting of chemical

substances and reactions, and it allows for a well-known fully-computational

description as a continuous-time stochastic system [8]. On the other hand,

when moving from chemistry to biology (hence considering biochemistry)

the notion of space structure enters the picture, and allows us to tackle in a

self-organised way key aspects related to how individuals can spread in the

network topology - a crucial issue for service ecosystems.

Now that I framed the metaphore to use within the biochemical world, lets

deeply describe the mapping from the three general concepts of species, en-

vironmental substrate and laws of nature to the correspondant biochemical

counterparts, hence reactants, compartments and biochemical laws.

Species as reactants. A chemical system is composed of chemical sub-

stances (or reactants): a chemical substance s can be considered as

made of a certain molecule m with concentration c floating in a given

portion of space, possibly in solution with many other substances s1; ...;sn. Concentration is directly responsible for the rate at which s reacts

with other substances, and ultimately, on whether/how it affects the

chemical dynamics at all. Substances may be produced, decay, combine

with others, act as catalysts, inhibitors, signals, data storage, and so

on.

The concept of chemical substance can hence be associated with that of

an individual: the molecule kind m is the individual kind, its structure


19/83


provides all interface information used to characterise the individuals

observable behaviour, while the concentration c is a numerical value

representing the activity level of the individual - the higher it is, the

more likely this substance will interact with others, and dually, it will

become inert as activity level fades. Accordingly, individuals can be

injected in the system and start interacting with others, by changing

shape, diffusing, being continuously generated/sustained or decaying.

The environmental substrate as a set ofcompartments. A chemical system

is typically made of a single solution where different substances float

around and interact. To make this scenario better fitting the shape of

distributed computing the biological concept of compartment is needed.

A compartment is a portion of space delimited by a membrane that

filters and regulates whether and how chemical substances can cross it.

Many compartments can exist into a system, in principle hosting to-

tally different substances and chemical reactions, thus possibly playing

different roles in achieving the overall system objective. Compartments

can even touch each other so that substances can move from onecompartment directly to the other, like in cells of a tissue.

The concept of compartment can be associated with that of world

location, i.e., an execution context for ecosystem services. A main

example of location is a network host, with touching compartments

modelling direct connection between nodes.

Laws of nature as biochemical laws. In biochemistry there are two basic

kinds of events that affect a system evolution: purely chemical reactions

responsible of changing the concentration of chemical substances, and

biomechanical actions responsible ofconfiguration changes - namely,

topological changes or chemical substances moving across membranes.

The first kind of events are well understood and studied even in the

context of Computational Systems Biology (CMSB) [9] - starting from

the work of Gillespie [10] and followed in languages like stochastic

p-calculus [11]. They are ruled by reactions of the kind X + Y r Z,


20/83

20 Background

meaning that when one molecule X collides with one molecule Y they

can interact by creating a new molecule Z (replacing the two original

ones), with a likelihood value expressed by reaction rate r - the actual

rate at which that reaction occurs being proportional to r, and to the

concentration of X and Y.

The second kind of events, biomechanical ones, are inspired by the work

in [12] to extend the mechanism of chemical reactions. The idea is to

allow standard chemical reactions to produce - other than chemical sub-

stances - also biomechanical actions, which are triggers that can make

some substance cross a membrane (hence diffuse to another networknode).

The reader may have recognized some feature of the biochemical metaphore

to be already-mentioned in the previous section, when I was describing my

vision of the model/system. Such correspondances are a first hint to the

complete mapping from the biochemical general framework above to my

molecules of knowledge model that will be formalised in Chapter 2.

In next section, a possible approach about how to ground the biochemical

metaphore into the journalism application domain and in particular its stan-

dards and methodologies is given.

3 IPTCs news standards

The IPTC (International Press Telecommunications Council) [13] is a consor-

tium of the worlds major news agencies, news publishers and news industry

vendors. It develops and maintains technical standards for improved news

exchange that are used by virtually every major news organization in the

world (among which the italian ANSA, the american Thomson Reuters and

the english BBC - see [14] for the full list).

One of the objectives for which the IPTC was established is to study tech-

niques, research and developments in telecommunications and to consider

how they can best be used to improve the flow of news. The following sec-


21/83

1.3.1 NewsML 21

tions describe two of its main standards designed to represent, organize and

exchange news with the aim to achieve such objective.

3.1 NewsML

NewsML [15] [16] is a media-type agnostic news exchange format standard

to convey not only the core news content, but also data that describes the

content in an abstract way (i.e. metadata), information about how to handle

news in an appropriate way (i.e. news management metadata), information

about the packaging of news themselves, and finally information about the

technical transfer itself.

It provides a set of useful abstractions:

the News Item - it veichles the news content, hence information report-

ing about what has just happened, providing a preview on what one

can expect to happen next and corresponding background information.

Althought this information can be presented in different journalistic

styles - article, blog post, report, comment, ... - and by different media-types - like text (articles), photo, graphics, audio or video - this single

abstraction is conceived to cover all these cases;

the Concept Item - since news are about events, persons, locations, or

themes and the like and such information is worth to be remembered

- and referred to - along with the news content to better identify, rec-

ognize, categorize - namely, manage - it, a data structure to collect all

this worth-to-be-remebered information is needed;

the Package Item - it is made to convey a structured set of items. It is

not merely a simple wrapper for news or concepts but has a feature to

structure information like by a table of contents: a package can have

groups of items and the groups itself can have sub-groups; each group

can have references to multiple items and references can be named like

Top 10 news of the week and the like;

the Knowledge Item - it is a container for many concepts, acting like

an encyclopaedia. This way a small, medium size or even large set of


22/83

22 Background

concepts can be distributed to receivers of news items to provide basic

knowledge about all the terms the news item refers to.

Briefly it could be said that the News Item is meant to be a comprehensive

container for a single news article as much as possible, conveying both

metadata tags and inline tags along with the news content. Metadata tags

carry all the information regarding the news item as a whole, such as its on-

line version URI, the author(s), the publication date and the covered topic(s);

inline tags instead are spread throughout the content of the news both to

give it a well-defined structure and also to carry all the additional informa-

tion that may be useful to better understand it and characterise even a singleterm inside it.

Having the capability to express and pack together all this information is

pretty much useless if there is no agreement upon its meaning. Moreover, it

should have a machine-readable representation to be succesfully processed

and exchanged by means of some automatic tool. This second issue is soon

addressed thanks to the eXtensible Markup Language [17], choosen by the

IPTC as the first implementation language for its standards (althought theycould be implemented in any other language). The issue about the shared se-

mantics is addressed by the IPTC with a couple of tricks: the afore-mentioned

Concept Item abstraction and the NewsCodes. Here follows how.

Values for metadata can be controlled or uncontrolled, and it is often desider-

able for metadata values to be controlled, that is restricted to a value or range

of values. One obvious reason for doing so is to convey clear and unambigu-

ous information about content. If a provider needs to inform a customer that

the content is a photograph, what term should be used: photograph, photo,

picture, pic? They might be understood by a human reader, but ad hoc terms

may not be processed reliably by software.

To this end the IPTC maintains sets of Controlled Vocabularies (CVs) that

are collectively branded NewsCodes [18]. These represent concepts that de-

scribe and categorise news objects in a consistent manner. By standardising

on NewsCodes, providers can ensure a common understanding of news con-


23/83

1.3.1 NewsML 23

tent and a greater degree of inter-operability between content from different

providers.

Concepts are the generic term used by the IPTC to denote real-world enti-

ties, such as people, organisations and places, and also abstract notions such

as subject categories. Then Concept Items are a model for managing this

information and making it available via CVs, enabling a single piece of news

content to be linked to a network of information resources. Using Concept

Items, both the news and the entities found in them can be easily identified

to make the content more accessible and relevant to peoples particular infor-

mation needs. NewsML Concepts are powerful because they bring meaning

to news content in a way that can be understood by humans and processed

by machines. This model aligns with work being done at the W3C and else-

where to realize the Semantic Web [19] vision.

Concept Items, being usable as metadata values, may be either uncontrolled

or controlled. Controlled concepts are managed by an authority (an organ-

isation or company) and are maintained in Controlled Vocabularies. They

are identified by a Concept URI, and their scope is global. Uncontrolled con-

cepts are identified by a literal string; their scope is local to the containing

document. Every concept, whether controlled or uncontrolled must be iden-

tified, and the identifier used must be unique in its scope. NewsML specifies

that the Concept URI must be a URL and that it should resolve to human-

readable and machine-readable information about the concept.

As someway related News Items could be packed together in a single Package

Item with the purpose to organize them, then all the Concept Items useful

to a certain common scope or describing the same entity could be collected

in a single Knowledge Item acting as an ontology both human- and machine-

readable.

Describing in detail how each of the four Items above works and their full

tags list is out of the scope of this brief introduction and anyway it will

be useless for the remainder of the thesis. Hence I will take some step fur-


24/83

24 Background

ther in the explanation only for the News Item, which in the very end is

the real news, and for the Concept Item, because it is responsible to give

machine-processable semantics to a news, a feature upon which I will rely in

my molecules of knowledge model.

The macro structure of a NewsItem is composed by four tags:

is the root element. It wraps anything else, including the other

three tags here listed, and carries some crucial information such as a

unique ID for the document, the XML namespace(s) and the News-

Codes catalog reference(s), used by NewsML interpreters to resolveConcept Items URIs;

carries the so-called management metadata, hence additional

information about news management such as its area of interest (a kind

of broad-topic), the provider of the news and its publication status

(wether it is usable, suspanded or cancelled);

wraps both administrative and descriptive metadata. Both

regards the news content, but while the former is about the source

of the news, its urgency, and the like, descriptive metadata is strictlyconnected to the content, storing for instance its covered topic(s).

is meant to wrap any media type, althought it is better to

phisically store only text leaving other media types, such as audio and

video streams, as external references (NewsML has dedicated wrappers

for photos, audio and video, similar to the NewsItem).

One interesting thing about the content of a NewsItem is that text could

be further tagged using other standards, for instance the NITF described in

next section.

The ConceptItem is quite similar to the NewsItem because it has the same

and sub-sections. Whats new is the

element which is a wrapper for the properties that express it in detail. The

following further tags are used to define a concept:

is the unique identifier of the concept, stored in the form of a

QCode. QCodes consist of two parts, separated by a colon: the first is an


25/83

1.3.1 NewsML 25

alias (scheme) that can be used to identify the IPTC NewsCode vocab-

ulary involved (for instance ninat stands for newsItem nature, hence

concepts about tha nature of a news); the second part of the QCode is

a reference into the vocabulary, hence one of its entries. Scheme aliases

are resolved by looking in an online Catalog. The reference(s) to cat-

alog(s) are carried at the root level of a NewsML document in the

correspondant tag ;

is the name of the concept in natural language;

and describe the nature of a concept. Both properties

demonstrate the use of the subject, predicate, object triple derived from

RDF [20] to express a named relationship with another concept. The

difference between the two properties in application is that can

only express one kind of relationship: is a. The current types agreed

by the IPTC and contained in the concept nature CV are:

cpnat:abstract for an abstract concept;cpnat:person for a person;cpnat:organisation for any kind of company;cpnat:geoArea for a geopolitical area of any size;

cpnat:poi for a somehow defined point of interest;cpnat:object for every objects (similar to the NITF pur-

pose, see later on);cpnat:event for a newsworthy event.

A uses either a @qcode or @literal to additionally describe

other inherent characteristics of a concept in terms of a named rela-

tionship with another concept. Such relationship may be identified in

the @rel attribute by a QCode; in this case a controlled vocabulary of

relationships, either maintained by an organisation such as the IPTCor custom-defined, would also be required.

allows to enter more extensive natural language information,

even with some mark-up if required.

The opportunity given by NewsML to the user to shape their needed con-

cepts, collect them in a KnowledgeItem and use them in their markup, both

for news metadata and for news content, is a great step toward interoperabil-

ity and automatic semantic processing of knowledge. Particularly important


26/83

26 Background

are the and tags along with the @rel attribute: their combi-

nation actually allows to shape a whole ontology as related ConceptItems!

Before going on to NITF standard, I wish to highlight one thing. In the

Introduction I described five areas of opportunity for which computer science

could help journalism and I stated that my work in this thesis would focus

on Document exploration and redundancy by helping journalists to manage

news and find stories. Please notice that other issues such as Combining

information from varied digital sources and Audio and video indexing can

be addressed simply by a wide-spread adoption of the NewsML standard: it

allows in fact to structure any kind of news source according to the same

set of tags, hence promoting different news sources interoperability, and has

dedicated newsItem-like objects to convey any kind of media, be it pictures,

video streams or audio files, thus making less-necessary to perform indexing

because relevant information are carried as metadata.

3.2 NITF

The NITF (News Industry Text Format) [21] uses the eXtensible MarkupLanguage (XML) to define the content and structure of news articles. It sup-

ports the identification and description of a number of news characteristics,

among which the most notable are:

Who owns the copyright to the item, who may republish it, and who its

about;

What subjects, organisations, and events it covers;

When it was reported, issued, and revised;

Where it was written, where the action took place, and where it may bereleased;

Why it is newsworthy, based on the editors analysis of the metadata.

From the few examples given for each of the news facets listed above, it is

clear that the NITF is able to express both additional information about the

content of the news and also metadata regarding the news lifecycle. More-

over, it supports most of the usual plain HTML tags for text structuring.


27/83

1.3.2 NITF 27

A NITF document is organized according to its main tags:

is the root element of the document, hence carries attributes to

identify the document, its time and date metadata and its category. It

must contain a head and a body;

holds the metadata about the document as a whole, such as its

, the subject covered thanks to tag,

and , its potenital area of interest through the tag and a list of items;

is the content of the document and is divided into the three follow-

ing sub-sections;

could contain either metadata useful to be displayed, such as

the author and contributors to the news article, or an abstract/summary

of the paper;

is the actual content of the news, hence it typically contains

text, references to pictures/videos, quotes and every inline tag and

HTML tag supported by the NITF.

is similar to in that they both could contain ad-

ditional information to be displayed. This usually carries a tagline or a

bibliography.

Since NewsML too has the capability to properly manage news-related meta-

data, the NITF someway overlaps. The best thing to do, is to exploit the

NewsML standard to wrap a single news articles content and its metadata

into a properly-structured container, that is the along with its

afore-mentioned metadata sub-tags (hence and ).

Then the NITF should be used to enrich the content of the news through its

inline tags, that is something NewsML cant do.

NewsML in fact provides no support for HTML tags to structure a doc-

ument neither any form of inline tagging to add information to the plain

text, for instance with the purpose to ease the work of any text mining algo-

rithm usable to automatically process the document. In this sense the NITF

and NewsML are complementary standards, hence they perfectly combine to

shape a very comprehensive and coherent framework to manage the whole


28/83

28 Background

news lifecycle: comprehensive because while one cares about news overall

structure, including metadata, the other focusses on their internal meaning

making it unambiguous; coherent because they both exploit the same IPTC

abstractions, for instance the NITF too makes usage of the NewsCodes tax-

onomies.

Heres the list of some of NITF most used inline tags, called by the IPTC

semantic units:

wraps personal names, both living people and fictitious. It could

contain the tag if the tagged person goes along with its

public role throughout text. Pay attention when some peoples name

is used as a company name or as an object definition, such as the

Thomson Reuters and a Picasso painting: in such cases use the proper

tags and ;

typically marks full official titles, such as the correct denotation

of political, commercial, clerical, military, civil appointments but is also

usable for their synonyms and journalistic variants. Such tag may be

even used to identify members of a profession (job titles) and with

family relations like father, wife as well as for other kinds of roles

such as consultant, employer and the like. The tag may

further be used to identify important (named) or indicative (unnamed)

players in recurring news-relevant scenarios, such as elections (the first

candidate), trials (the special prosecutor), accidents (the driver) and

natural catastrophes, business, cultural or sport events;

serves to identify organisational names. An inner tag ()

allows to add special widely agreed-upon codes, such as codes from the

Standard Industry Classification (SIC) [22] list or even NewsCodes. It

also covers personification of organisations, as in phrases such as the

Government said. Pay attention when some peoples name or even a

location is used as organisation, for instance in phrases like The Nobel

committee decided... or The White House stated that.... Watch out

also for product names such as The new BMW Z4 sport car... which

calls for the proper tag;


29/83

1.3.2 NITF 29

identifies geographic locations and significant places. It either

contains mere text or structured information thanks to its possible

inclusions , , , and .

It may also comprise significant man-made structures, such as famous

buildings and constructions, bridges, walls, buildings, highways and the

like. As already said, watch out for possible confusion with the

tag and keep in mind to use the proper tag for special cases

such as the Chernobyl catastrophe;

should be limited to newsworthy events or events that carry news

value in the sense of journalism. Factors of news value are for in-stance significance, proximity, prominence of the involved persons, con-

sequence, unusualness, human interest and timeliness. The possible am-

biguity with the tag has been already described above.

should include named news-relevant world objects as publica-

tions and media types (books, newspapers, CDs, TV series), mass me-

dia channels (TV channels, radio stations), titles of awards and prizes,

names of products and product lines, art objects, animals, ships, build-

ings and so on. It could virtually tag anything that is newsworthy andthat no other tag could wrap. It may seem a bit under-constrained,

but it gives the journalist the opportunity to tag specific-interest terms

even according to a controlled vocabulary. For instance, if the news

talks about cancer, then the journalist (or even a software agent) could

exploit either an ad-hoc or a well agreed upon medical ontology and

tag every interesting term recognized from it, so to allow semantic rea-

soning over the news content!

tags concrete dates and days of the week, religious and bank hol-idays, and relative time expressions that may be attributed with a

concrete date such as Christmas Eve and the like.

Thanks to these pre-defined tags and to the opportunity to make their values

constrained to some kind of controlled vocabulary, be it from the NewsCodes

or an ad-hoc ontology, the user of the NITF standard has a great expressive

power about news content enrichment.


30/83

30 Background

As the NewsML standard could do, the NITF too can address at least one of

the open issues listed in the Introduction: Information extraction. If a doc-

ument is properly NITF-tagged, then its worth-to-remember entities are all

machine-processable items since every NITF tag has a well defined mean-

ing and their values too could be formally defined through taxonomies as

the NewsCodes. NewsML and NITF wide-spread adoption could alone face

many problems regarding news management and sharing.


31/83

Chapter 2

Molecules of knowledge model

Non ho fallito, ho trovato mille modi

per non costruire una lampadina

- Thomas Edison -

Now that all the necessary knowledge to deal with the molecules model has

been acquired, I wish like to give the reader a brief and informal description

of such model, highlighting the main entities and their counterparts drawn

from the biochemical metaphore and from NewsML and NITF standards.Then, for each of these entities, possible requirements are devised and a

first specification that fullfills them is given. Finally, the formal molecules of

knowledge model is detailed.

1 Informal introduction to the model

At the beginning of the previous Chapter I gave the reader my vision both of

the model to conceive and of a possible self-* system designed upon it. Suchvision was outlined according to three different phases of a news lifecycle,

that are production, management and consumption. Here I would like to

recall such phases to introduce the main entities of the model, which are

inspired by the biochemical metaphore and grounded into the NewsML and

NITF IPTCs standards.

Production. Assumed that every news source exploited by the system pro-

sumers is properly structured according to NewsML and NITF stan-


32/83

32 Molecules of knowledge model

dards , I will also assume that such sources are reified within the system,

hence in the model too, as seeds both if they are external or internal.

According to the biochemical metaphore, such seeds can be seen both as

catalysts and as atoms: catalysts because their presence affects the sys-

tem behaviour through their continuous injection of knowledge atoms;

atoms because nothing forbids the system to manipulate them as they

were pieces of knowledge themselves, rather than news sources. The ex-

istence of seeds is extremely important because atoms may fade, hence

information will be lost forever in their absence. Moreover, reifying news

sources as seeds allows to keep all the relevant knowledge inside themodel/system, while any kind of interface agent doing seeds job would

make such knowledge external, hence dependant on agents availability

(upon which the system could have no control).

A first fundamental entity of the model is hence the seed. Its counter-

part in the IPTC standards could be the News Item as a whole, since

it represents a single source of knowledge. Moreover some of its poten-

tially worth-to-remember properties could be described by NewsML

tags such as to identify the provider (for instance ANSA), for the date, to describe where it is lo-

cated, for its author and .

Created and injected by the seed, another one of the main model entities

is the atom (of knowledge). Its biochemical counterpart is clear: it is

one of the reagents living in the solution represented by the set of all

the atoms that co-exists in a given chemical compartment. As such, it

will have a concentration value associated, as the chemical metaphore

wants.

Atoms do actually have a clear counterpart in NewsML and NITF stan-

dards: the tag. Tags can in fact be seen as the atoms that altogether

compose the news-substance. Hence it is possible to see living within

the system atoms, atoms, atoms,

atoms and almost every other NewsML/NITF tag.

Management. Now that the system is full of wandering atoms, each gener-


33/83

2.1 Informal introduction to the model 33

ated by its parent seed at a certain rate, they will end to collide, ei-

ther randomly or driven by some well-defined mechanism. The outcome

of these inter-atom interactions are the third fundamental entity of

the model: the molecule of knowledge. According to the chemical

metaphore, molecules could be seen as composite substances in which

there arent many instances of the same atom, that means a single

species of atom with as many individuals as its concentration value,

but many instances of different atoms.

Molecules are spontaneous, stochastical, environment-driven aggre-

gations of atoms, possibly reifying some meaningful similarity betweenthem, hence adding new knowledge to the system. They are sponta-

neous in that they simply happen as a natural evolution both of the

internal system behaviour and of the prosumers interactions; stochas-

tical as required by the chemical metaphore grounded in the work of

Gillespie [10], which allows for the emergence of a plethora of self-

something properties, above all self-adaptation; driven by the environ-

ment because althought stochastical, their likelihood to actually take

place is modulated both by other molecules/atoms living in the com-partment and by catalysts that could intervene.

The role of driving such aggregations is taken by another fundamen-

tal abstraction of the model: the chemical reaction. The name is

quite self-explanatory about their biochemical inspiration: they are the

transition rules, namely the chemical-like laws, that the chemical en-

gine reified by the system enacts to evolve itself, that is the atoms and

molecules (and even seeds too) it stores. Since they are meant to cre-

ate molecules, they must necessarly be spontaneous, stochastical andenvironment-driven, exactly as described above (and in the chemical

metaphore section of previous Chapter).

Both entities could be grounded to the NewsML and NITF standards:

since molecules are bags of atoms they are actually bags of tags,

hopefully somehow related tags; since molecules should hopefully be

meaningful, chemical reactions that generate them should not be com-

pletely blind to the nature of their reagents. In other words they


34/83


should not be purely random transitions. Such chemical laws applica-

tion may be influenced by structural relationships about their reagent-

tags, relationships that actually exists in NewsML and NITF: for in-

stance a tag is always inside a tag and

describes metadata regarding a tag.

Moreover, semantical relationships between tags values may be taken

into account too, since both NewsML and NITF give to the user the

ability to draw such values from either controlled vocabularies or even

full ontologies.

Consumption. As already said, users of the model/system are prosumers,

hence they want also to consume knowledge rather than solely produce

it. Prosumers should be able to retrieve all the pieces of knowledge

stored within the system, access them to inspect their content and

navigate their relationships in the case they are molecules, combine

them to create their own new knowledge and so on.

Notice that every time a prosumer uses an atom/molecule, such us-

age action has other effects beyond the actual consequences of thecomputation. As already said they can be interpreted by the systems

chemical engine as positive feedbacks to the relevance/usefullness of an

atom/molecule, hence they should influence the correspondant concen-

tration. Lack of actions too is a feedback, this time a negative feedback

that should make atoms and molecules decay as time passes.

Due to all these possible side effects both on systems state and be-

haviour (remind that seeds too can be accessed and manipulated, for

instance their injection rate & concentration), prosumers interactingwith the knowledge can be seen as catalysts/inhibitors, the last main

entity of the model directly drawn from the chemical metaphore. They

wont have any NewsML/NITF counterpart, since they are the journal-

ists using such standards, or even automatic processors (agents) able

to interact with the knowledge stored in the system.

Summing up, the molecules of knowledge model is designed around the fol-

lowing abstractions:


35/83

2.1.1 About topology 35

seeds the news sources;

atoms the NewsML/NITF tags; molecules possibly meaningful bags of tags;

chemical reactions the reifications of the (possibly useful) rela-

tionships among the tags in a bag of tags;

catalysts/inhibitors the journalists, prosumers of knowledge.

1.1 About topology

Before next section in which each of these abstractions is detailed, I wish tofurther describe one aspect of the molecules of knowledge model/system that

has been only mentioned until now: distribution.

If the reader remembers, in the first Chapter I stated that the chemical

metaphore alone wont be enough for my model, because it doesnt account

for any kind of spatial aspect to be considered thus managed. Then such

metaphore was completed with the concept of chemical compartment

drawn from biology, leading to the biochemical metaphore able to model andproperly deal with network topology related issues.

I would like to remark here that such enhancement has not been done merely

to give more expressive power to the model, but that it is strongly encouraged

by the nature of the problem it tries to face, that is knowledge management

in general. In fact, nowadays it is quite an utopy to design a knowledge man-

agement system that is not distributed among different computational nodes,

possibly crossing administrative domains and located at different places.

Moreover my elected application domain is journalism, where distribution

plays an essential role too. A possible use case for the molecules of knowl-

edge model could be to help journalists working in a journalistic heads news-

room: they will probably have their own personal devices (be them laptops,

tablet or whatever) in which they store their news sources, annotations, self-

produced articles and the like. Then the model with all its abstractions could

be installed in every one of this devices, transforming each of them in a


36/83


single chemical compartment, hence with its own seeds, atoms, molecules and

chemical reactions, situated somewhere within the whole network of all the

other chemical compartmentes, that is all other journalists (notice that this

will be a mobile network actually).

For these reasons, from now on I will always assume a distributed network

topology to which apply the molecules of knowledge model, in which every

node is the chemical compartment belonging to a precise prosumer (hence

influenced by a well defined catalyst), in which he/she stores his/her own

seeds, atoms, molecules and chemical reactions.

In Section 3 I will talk about spatial interactions and I will describe how to

exploit distribution thanks to neighborhood relationships between com-

partments and atoms/molecules diffusion mechanism (in truth I will only

mention such relationships, because I will rely on a cited paper).

2 Model abstractions

In the following sections, each of the model abstractions just highlighted will

be given a set of requirements to satisfy according to the main goal of this

thesis. Along with such needs, also possible solutions are described and a first

pseudo-formal specification is given too.

2.1 Seeds

Seeds requirements can be devised directly from the brief introduction given

at the beginning of the Chapter. Since they are the reification of any news

source that a journalist would like to consider in his/her knowledge port-

folio, they should carry some information about it. Moreover, they are re-

sponsible for the injection of atoms of knowledge, hence they should store

meta-information about this process too.

Focussing on news source identification and description, NewsML and the

NITF standards provide a number of tags that are potentially useful: ,


37/83

2.2.2 Atoms 37

, etc. are just a few of the many previously mentioned. Some kind

of unique identifier for the news source is undoubtely necessary too: since

I wish to reuse as much as possible features from NewsML standard, I will

rely on URIs, which have the advantage to be highly encouraged by the W3C

for the Semantic Web vision, for instance in its OWL language. Then, this

collection of tags, along with their content, could be the first information to

store into a seed, fullfilling the first requirement.

Regarding the injection mechanism, three essential information should be re-

mebered: i) first of all, the atoms to be spawned (whose internal structure is

detailed in next section); ii) then, the concentration of every atom to create,

so to generate the exact quantity of each at every injection step; iii) finally,

the injection rate, to generate each atom at the right frequency/probability.

Putting these observations altogether, the following could be a first pseudo-

formal specification of a seed element (I will use a Prolog[23]-like syntax for

its readability):

seed(srcID, srcMeta, [atoms

], [concentrations

], [rates

])

where:

srcID is the URI (or equivalent identifier) of the news source;

srcMeta is the collection of the NewsML tags afore-mentioned;

[atoms] is the list of every single atom to spawn;

[concentrations] is the list of each atoms initial concentration (possibly

different for each of them);

[rates

] is the list of atoms injection rates (again, possibly different for eachof them).

2.2 Atoms

To fruitfully shape a single atom of knowledge as best as possible, the main

goal is to balance two different competing needs: on one hand it should em-

bed enough knowledge to be useful from both the system and the prosumers

point of view; on the other hand the atom is the most primitive piece of


38/83


knowledge within the model, hence it should be kept as much simple as pos-

sible.

I will try to reach the needed equilibrium taking into account the following

complementary facets:

Granularity of knowledge. While grounding the chemical metaphore into

NewsML and NITF standards, I stated that any of their tags could be

mapped in a single atom, hence following their structure and semantics,

a six-level scale for the granularity of a piece of knowledge could be

identified:

1. the single NITF tag (finest granularity);2. a descriptive or administrative wrapper;3. the , or wrappers;4. the whole ;5. a single tag within the of a ;6. the whole container (coarsest granularity).

Pay attention that having a single abstraction able to cover all these

different quantitative of information may seem to overlap with the

molecule abstraction, making it useless. This is actually wrong, be-

cause molecules are a completely different concept: an atom may be as

comprehensive as needed but will always be a single not-divisible unit

of information; a molecule instead is the reification of a number of rela-

tionships between different atoms, possibly coming from different seeds.

Context of knowledge. Any piece of knowledge could be misleading if taken

out of its context, because the context is the set of the environmental

conditions needed to correctly interpret it. In other words, context

gives or at least enriches semantics of a piece of knowledge, allowing in

the end for a better/correct understanding of it.

Thus it will be undoubtely useful to embed a certain degree of se-

mantics description in an atom, rather than its content alone. Here

NewsML and NITF standards come in hand with a couple of features:

i) being standards their tags have a well-defined meaning, ii) since they


39/83

2.2.2 Atoms 39

are implemented in XML they are highly interoperable and easily ex-

changeable, iii) tags values too may have a formal semantics thanks

to NewsCodes or external ontologies (coded as Knowledge Items).

For these reasons a first enrichment to an atoms content could be to

store also the related NewsML/NITF tag that wraps it, but this alone

isnt enough.

It has been already explained how NITF tags can experience some

kind of ambiguity about their usage, but even more problems could be

faced. Lets think about the following phrases: Mr. Marchionne is CEO

of FIAT and FIAT has provided a thousand new job opportunities.. Inboth cases FIAT should be tagged with the tag, but while

in the first case it covers the role of the object, namely answering the

question: Mr. Marchionne is CEO of What?, in the second it is the

subject, hence the Who.

Hence it could be useful to explicitly say which one of the famous 5

W of journalism the current tag is describing, hence if its about the

Who, What, Where, When or Why. Thats another useful information

to store in an atom.

Its not finished yet. Since NewsML and NITF tags values could be

drawn from controlled vocabularies or even ontologies, their meaning is

asserted unambiguously once and for all by these taxonomies. Hence,

I could inject in an atom some information to identify them, namely

the QCode and catalogue: both are logical names that together address

a web page (or even a local file if their scope is local within the user

company) in which the schema is formally defined as in machine- as in

human- readable form.

Relevance/Usefullness of knowledge. A definitory property of a news is its

relevance, hence how its perceived interesting both by the professionists

who manage it and by the target audience to whom it is directed.

Moreover, every news has some kind ofusefullness, measured according

to some criteria: for instance, the level of new knowledge acquired by a

reader or even economic revenues it could generate. These are somehow


40/83


two faces of the same coin: as more relevant news are expected to

be more useful to readers/journalists, then useful news may spread

through readers and publishers gaining relevance.

Since atoms carry some piece of information extracted by a news, it

is quite natural to distribute the relevance/usefullness of the original

source of knowledge as a whole among the (possibly) many atoms ex-

tracted from it.

Another definitory property of a news is, as the word itself suggests,

its novelty, hence both how much new is the knowledge it provides

with respect to the actual environment and also how much new it iswith respect to time passing: it is obvious that while news become older

and older they lose relevance and public interest, following a grace-

ful degradation process. As done before for relevance/usefullness, this

time-dependancy property could be easily transferred to the atoms

of knowledge: the less they are shared and used by cooperating jour-

nalists, the more they are going to lose their cultural/economic value.

Since these three facets of a news, that are relevance, usefullness and

novelty, are so deeply influenced one by each other, they all could bemodeled with a single abstraction: the concentration.

From the biochemical metaphore in fact, it is known that an atom/molecules

concentration is a measure of its activity level, namely how much it

could and should influence the overall chemical behaviour of the solu-

tion (system). Since such concentration is subject to a time-dependant

fading mechanism, namely atoms/molecules decay, the mapping rele-

vance/usefullness concentration is perfect!

Summing up, an atom of knowledge should not carry only the content of a

(piece of) news, hence the tag along with the tagged term/phrase, because

this way its semantics could be not clear. I have identified two other pieces

of knowledge that are worth-to-remember and useful to better veichle se-

mantics: i) one of the 5 W and ii) the QCode and catalogue information.

Moreover, concentration too should be explicited, so to model the atoms

relevance/usefullness (and novelty too). As a last bit of info, since atoms are

automatically injected by their own parent seed, it could be useful to bring


41/83

2.2.3 Molecules 41

some data from such seed to the atom.

Here it is a possible atoms syntax:

atom(srcID, info(tag, content), meta(w, qcode, catalogue), concentration)

where:

srcID is taken from the source seed;

info(tag, content) is the actual piece of news the atom veichles, hence some

content (from the whole paper down to a single term in it) along

with its tag;

meta(w, qcode, catalogue) is the additional information that helps clarify the

atoms semantics, thus one of the 5Ws and the QCode and catalogue

information grounded in NewsML/NITF standards;

concentration is the actual activity level of the atom. Notice that this value

will necessary coincide with the one specified in the source seed only at

injection time: later on it will evolve according to the system behaviour.

2.3 Molecules

Molecules of knowledge may seem the most complex abstraction to deal

with, because in the very end all other are built around them. In fact, chemi-

cal reactions consume seed-generated atoms to forge molecules, creating new

knowledge within the system, while catalysts inspect them to acquire knowl-

edge.

In truth, a very simple interpretation about what a molecule is can be given,

assuming that chemical reactions, to whom they are deeply related and de-

pendant, are properly shaped. How? Here follows my explanation.

Since molecules of knowledge are reifications of interactions among different

pieces of news, they are full of implicit semantics about such interaction.

Moreover, hopefully molecules are composed pursuing some goal and accord-

ing to some criteria, for instance the chemical engine could try to aggregate

atoms similar on a topic basis, for geographical reasons or because they are


42/83


chronologically ordered. Then the implicit meaning that a certain molecule

carries, is actually given by the particular chain of chemical reactions that

during time shaped it.

Thanks to negative feedbacks, there is no need to teach the system how to

build only useful aggregations and how to detect and discard meaningless

ones: simply the latter will fade away as an emergent natural selection

process, driven both by systems internal behaviour and by external pro-

sumers interactions. Then there is no reason to explicitly state neither why

a certain molecule has been generated nor how its atoms are related one

to each other. In other words, the afore-mentioned aggregations semantics

could remain implicit: if relationships are relevant/useful, they will survive

because a number of prosumers sees some meaning in them; otherwise, if

nobody finds them interesting such molecules will simply decay until death.

For these reasons, the simple interpretation I am talking about is that a

molecule of knowledge could be viewed as a bag of atoms, hence a single

unordered set of somehow related atoms. According to this interpretation,

a molecule could be simply shaped as follows:

molecule([atoms], concentration)

where:

[atoms] is the list of all the atoms currently bondend together by the

molecule, hence the pool of related pieces of knowledge that a certain

chain of reactions has aggregated during natural system evolution;

concentration is the actual concentration of the molecule.

Please notice that every single atom inside the [atoms] list has not exactly

the same internal structure of a standalone atom. Since it is now part of

a greater aggregation, its concentration is no longer meaningful because the

molecule has its own, hence it is removed from atoms syntax.

Thus, the complete structure of a molecule (omitting a whole list of atoms

for brevity) should be as follows:


43/83

2.2.4 Chemical reactions 43

molecule([atom(srcID, info(tag, content), meta(w, qcode, catalogue)), ...],

concentration)

2.4 Chemical reactions

In the previous section, in which an informal introduction to models abstrac-

tions was given, I stated a couple of interesting things regarding chemical

reactions. First of all, they are responsible for the consumption of atoms and

the production of molecules, but this is quite obvious. Whats not so obvious

is how molecules are produced and atoms are consumed, in the sense of which

are the criteria to bind atoms together in a molecule and the mechanisms toactually do so. Now Im going to recall these interesting things.

First of all, since most of the NewsML and NITF tags have well-defined

dependancy relationships, a chemical law could exploit them to pack some

kind of NewsML/NITF-compliant molecule. For instance, the self-* sys-

tem built upon this ongoing model could decide to pack together all the tags

(along with their content) nested in a tag. This could hap-

pen because they are frequently accessed together, thus the system tries tospeed-up research latency: prior to the molecule all the single atoms have to

be retrieved; with the molecule this is done in one shot by looking directly

for it.

Moreover, virtually every NewsML/NITF tag could have its admissible val-

ues collected, stored and defined formally by a controlled vocabulary or an

ontology, hence semantical relationships too could be exploited by chemical

reactions! When semantics enters the field of computation and interactions a

plethora of interesting and meaningful behaviours arise to be explored. For

instance, the chemical engine may browse tags values source taxonomies to:

i) discover if two different terms are synonyms, hyperonyms, and the like,

then decide to aggregate the correspondant atoms in a thesaurus molecule;

or ii) navigate relationships among different concepts from the same ontology

and reify such links, such as understanding that the Minister of Defense is a

member of the Government, thus it is in the staff of the Prime Minister and

reify such reasoning putting them both in a taxonomy molecule.


44/83


Finally, the most obvious relationship between atoms has not to be omitted:

if they carry the same content they are undoubtely related (maybe such re-

lationship is trivial hence useless, but exists anyway)! For content here I

mean the true content, hence only the tagged term or phrase without con-

sidering the tag. This allows to relate different atoms (thus possibly different

news sources) in which the same thing is tagged differently, for instance when

news A says Termini Imerese is in trouble and news B says employees are

occupying Termini Imerese factory: the first Termini Imerese tag could proba-

bly be a because the term is used in place of FIATs Termini Imerese

factory, while the second tag could be a tag because Termini

Imerese is really a city.

Summing up, a first collection of patterns to join atoms into molecules could

be based upon:

the tag field inside the info(tag, content) term of an atom, in the case

of a structural relationship between different NewsML/NITF tags;

the whole meta(w, qcode, catalogue) term if the relationship is seman-

tical;

the solely content inside the info(tag, content) term of an atom whenever

a subject-based link has to be reified into a molecule.

Now Ive answered first question from the beginning, that was about possi-

ble criteria upon which molecules are composed. Whats left is question two:

which mechanisms to use to aggregate atoms producing molecules?

The answer is directly provided by the biochemical metaphore: chemical

reactions are the tool. Im not gonna list all the possible concrete chemical

reactions to inject in the system to obtain every possible instantiation of the

above described patterns; Im just going to define the structure & semantics

of a general-purpose chemical law for each of the patterns, in the sense of

how many reagents it may have, of which kind, how they should be similar

one each other, whats the produced substance and the like. First of all lets

see the common look that every chemical law will have.


45/83

2.2.4 Chemical reactions 45

Following literally the interpretation of molecules as bags of atoms, a chemi-

cal reaction simply takes a list of atoms as input reagents and produce a single

molecule as output product. Both involved concentrations, hence reagents

and products, are a single unit, thus a single instance of input atoms is con-

sumed (one each) and a single instance of the output molecule is generated.

But this way molecules cannot be part of a chemical reaction as reagents,

hence they cannot be consumed except by prosumers. This is undesiderable,

because molecules are living and evolving entities pretty much like atoms,

thus nothing should forbid them to join one another or to absorb additional

atoms.

Adding such feature, a generic chemical reaction could look like this (omitting

internal fields for the sake of clarity):

( atom | molecule ) r join molecule([atoms], concentration++)

where reagents could be any combination of any number of atoms and molecules

while product is exactly one molecule aggregating all the atoms on the left-

hand side. This suggests that reagents molecules are somehow unpacked to

extract atoms and inject them in the new molecule. Please remember what

was said about the [atoms] list in previous section to avoid confusion regard-

ing notations.

Now that the most general-purpose chemical-like law has been presented, it

is time to describe its concrete applications to obtain the afore-mentioned

patterns. As already said, the following are still general purpose laws, be-

cause they only state who should be similar to who for the reaction to be

applied and similar information.

The first chemical reaction is meant to produce molecules that aggregate

structural-related atoms, based upon the well defined relationships among

NewsML and NITF tags. Assuming to use apices () to denote some structural

dependancy among tags, such chemical reaction could be as follow (omitting

unnecessary fields to enhance readability):

( atom(srcID, info(tag, ), , 1) | molecule([atom(srcID, info(tag, ), ), ...], 1) )


46/83


r structural join

molecule([atoms], concentration++)

This law states that: i) only atoms/molecules all coming from the same news

source could be bound together, ii) such reagents tag fields should have

some dependency according to structural constraints of the NewsML and

NITF standards. Other aspects of the law are inherited from the general

purpose one already described, for instance one unit of concentration is in-

volved, reagents could be in any number, input molecules should be unpacked.

Going on to the second aggregation pattern, I assume that symbols ()and () denote some kind of semantical relationship between terms, for in-

stance according to a thesaurus or ontology involving such terms. This kind

of NewsCodes-based chemical reaction could be shaped as follows:

( atom( , info( , content), meta( , qcode, catalogue), 1) |

| molecule([atom( , info( , content), meta( , qcode, catalogue)), ...], 1) )

r semantical join

molecule([atoms], concentration++)

Such transition rule states that: i) no m

Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

Documents

Transcript of Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption