Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

download Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

of 83

Transcript of Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    1/83

    Alma Mater Studiorum

    Universita di Bologna

    II Facolta di Ingegneria

    Corso di Ingegneria Informatica

    Laurea Magistrale in Sistemi Distribuiti

    Molecules of knowledge: a new

    approach to knowledge

    production, management and

    consumption

    Candidato RelatoreStefano Mariani Prof. Andrea Omicini

    Anno Accademico 2010/2011 - Sessione II

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    2/83

    .

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    3/83

    .

    Ad Alice, perche senza di lei sarei solo,

    ai miei genitori, che mi hanno dato questa possibilita,

    a mio fratello, che era meglio se giocavi a WoW,ai miei nonni, che avrei voluto fossero qu,

    a tutti i miei amici, la cui provvidenziale ironia

    mi ricorda sempre di non prendermi troppo sul serio.

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    4/83

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    5/83

    Contents

    Introduction 7

    1 Background 11

    1 My vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2 The biochemical metaphore . . . . . . . . . . . . . . . . . . . 15

    3 IPTCs news standards . . . . . . . . . . . . . . . . . . . . . . 20

    3.1 NewsML . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3.2 NITF . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2 Molecules of knowledge model 31

    1 Informal introduction to the model . . . . . . . . . . . . . . . 31

    1.1 About topology . . . . . . . . . . . . . . . . . . . . . . 35

    2 Model abstractions . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.1 Seeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.2 Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    2.3 Molecules . . . . . . . . . . . . . . . . . . . . . . . . . 41

    2.4 Chemical reactions . . . . . . . . . . . . . . . . . . . . 43

    2.5 Catalysts/Inhibitors . . . . . . . . . . . . . . . . . . . 48

    3 The spatial-temporal fabric toward self-adaptation . . . . . . . 52

    3.1 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    3.2 Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    3.3 Self-adaptation . . . . . . . . . . . . . . . . . . . . . . 55

    4 The formal model . . . . . . . . . . . . . . . . . . . . . . . . . 57

    3 Model behaviour examples 61

    1 Seeds generating atoms . . . . . . . . . . . . . . . . . . . . . . 62

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    6/83

    6 Contents

    2 Diffusion, decay and positive feedback . . . . . . . . . . . . . . 65

    3 Molecules from atoms . . . . . . . . . . . . . . . . . . . . . . . 68

    Conclusion and further developments 71

    Appendice - Sommario in italiano 75

    Bibliography 79

    Acknowledgments 83

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    7/83

    Introduction

    Information specialists, namely journalists, are facing new and critical chal-

    lenges in their knowledge production process: the increasing amount of in-

    formation to mine, the pace at which its made available and all the different

    formats and paradigms existing to represent and think of it are just a few

    to mention.

    A new field is emerging to promote the process: computational journalism.

    By developing techniques, methods, and user interfaces for exploring the new

    landscape of information, computer scientists can help discover, verify, and

    even publish new public-interest stories at lower cost. For computational-

    ists and journalists to work together to create a new generation of reporting

    methods, each needs an understanding of how the other views data. Jour-

    nalists are in fact a special kind of information-seekers, because they look

    for the unusual handful of individual items that might point toward a news

    story or an emerging narrative thread.

    Over the past two years, Sarah Cohen, James T. Hamilton, and Fred Turner

    have conducted scores of interviews with reporters, editors, computer scien-

    tists, information experts, and other domain researchers to identify collabo-rations and projects that could help reduce the cost and difficulty of news

    production and knowledge management [1]. Their conversations identified

    five areas of opportunity:

    Combining information from varied digital sources. The capability to put

    into one repository material not easily recovered or searched through

    existing search engines is currently missing quite at all, because the

    only thing journalists can do actually is to manually mine interesting

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    8/83

    8 Introduction

    sites and take annotations. This is due to the eterogeneity of the form

    and format according to which each source of information publish and

    organize its contents.

    Information extraction. Beat reporters might cover one or more counties, a

    subject, an industry, or a group of agencies, hence most of the docu-

    ments they obtain would benefit from entity extraction. But effective

    use of these tools requires computational knowledge beyond that of

    most reporters, documents already organized, recognized, and format-

    ted, or an investment in commercial tools typically beyond the reach

    of news outlets in non-mission-critical functions.

    Document exploration and redundancy. Reporters need to notice informa-

    tion that is not commonly known but that could lead to news in in-

    terviews, documents, and other published sources. Though, the recent

    explosion in blogs, aggregated news sites, and special-interest group

    compilations of information makes distinguishing new stories time con-

    suming and difficult, hence the ability to group documents in interesting

    ways would immediately reduce the time and effort of reporting.

    Audio and video indexing. Unless a third party has already transcribed,

    closed-captioned, or applied speech-recognition techniques on the record,

    most reporters have no way to move to the portion of it that contains

    what may be of interest. Existing technology is probably adequate for

    reporters immediate needs, but as these interviews suggest there arent

    simple user interfaces to the technology that would allow unsophisti-

    cated users to test the technology on their own recordings.

    Extracting data from forms and reports. Much of the information collected

    by reporters arrives in two genres: original forms submitted to or cre-

    ated by news agencies, often handwritten, and reports generated from

    larger systems, sometimes electronically and sometimes on paper. Jour-

    nalists have few choices today: retype key documents into a database,

    attempt to search recognized images, or simply read them and take

    notes. Extracting meaningful information from forms is among the most

    expensive and time-consuming job in large news investigations: its cost

    sometimes results in abandoning promising stories.

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    9/83

    Introduction 9

    This thesis will mainly focus on the third issue, that is Document exploration

    and redundancy. The main objective in fact, is to provide knowledge pro-

    sumers, hence both producers and consumers as tipically journalists are, a

    brand new model both to think at knowledge lifecycle under a brand new

    perspective and also to shape knowledge and the knowledge production pro-

    cess itself accordingly.

    Althought the work done in this thesis is tailored to the application domain

    of journalism, hence most of the time knowledge actually means journalistic

    news to me, most of its ingredients and ideas are easily reproducible in

    other areas, namely wherever a self-organising knowledge management sys-

    tem is needed [2]. Moreover, the model here conceived can be easily extended

    to deal with each of the previous highlighted issues: in fact, some of them are

    assumed and some others can be covered as will be mentioned throughout

    the thesis.

    The remainder of the paper is organized as follows:

    Chapter 1 introduces some background information necessary to bet-

    ter understand the model, namely the biochemical metaphore for dis-

    tributed coordination systems and the IPTC NewsML and NITF jour-

    nalistic standards to represent news content, structure and semantics

    in a machine-readable format;

    Chapter 2 defines the molecules of knowledge model and how it could

    be used to design a self-organising news management system;

    Chapter 3 shows some brief sperimentation Ive done to observe how

    the model behaves;

    then I draw conclusions about the work done and guidelines for further

    investigations.

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    10/83

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    11/83

    Chapter 1

    Background

    La disumanita del computer sta nel fatto che,

    una volta programmato e messo in funzione,

    si comporta in maniera perfettamente onesta.

    - Isaac Asimov -

    As a first thing, I would like to describe the reader my view of the news

    lifecycle and how it can be re-thinked under the brand new perspective of

    the biochemical metaphore recently exploited in distributed coordination sys-tems. Then it becomes necessary to describe such metaphore, that is what

    second section does. In the end, the IPTC NewsML and NITF standards are

    briefly introduced since they are the foundations of my molecules of knowl-

    edge model.

    1 My vision

    I wish to depict the overall scenario that the reader is encouraged to imagineto fruitfully understand what the following sections and chapters are talk-

    ing about and what is their purpose. To this end, I think it is better to

    distinguish three phases in the news lifecycle: production, management and

    consumption.

    Production. Journalists will gain the knowledge they need to create news

    from different sources of information. Such sources could be either i)

    external to the self-organising system, as RSS feeds aggregators, news

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    12/83

    12 Background

    agencies (as the Italian ANSA) broadcasts, digital-articles from online

    papers, even the set of posts that are part of the same thread in a

    blog; or ii) internal, such as system prosumers own articles, news and

    comments/annotations to existing knowledge.

    Whichever is the nature of a source of knowledge, I will assume that

    either i) it is already structured or ii) there exists a proper entity, within

    the system or out of it, able to do so (for instance an interface agent at

    the border between the self-organising system and the external sources).

    Structured information means to me that it has been built, organized

    and distributed according to some standard, either a general purposeknowledge representation language as OWL2 [3] from the W3C or a

    more domain specific as the IPTCs standards NewsML and NITF.

    I will consider the second approach (the two standards mentioned will

    be described properly in the following).

    These structured information sources will be either i) reified within the

    self-organising system as seeds or ii) managed again by a proper

    entity (namely another interface agent). In both cases I assume that

    these sources continuously and autonomously inject in the system someatoms of knowledge, which at the moment could be interpreted

    as autonomous and independent living pieces of knowledge (actually

    they are single NewsML/NITF tags, as will be described).

    The fundamental matter is that this injection is not a one-shot op-

    eration, but it is continuous in time and its rate could be changed

    according to the systems state and its desired behaviour. For instance

    recently published news could be injected at a higher rate, hence more

    often in a given interval of time, than older ones; or if the system isoverloaded every sources could be slowed down to give it time to dis-

    pose of it, while if experiencing scarcity of new atoms existing sources

    could be excited to increase their injection rate.

    Moreover, every single injection does not add to the self-organising

    system a single atom, but a variable number of identical copies of an

    atom, namely its concentration.

    This quantitative information models the atoms relevance and usefull-

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    13/83

    1.1 My vision 13

    ness: the higher it is, the higher is the importance implicitly attributed

    to that atom within the system by the system itself, hence the more it

    is capable to influence its behaviour. Concentration may be given ei-

    ther by i) the prosumers if they extract the atoms by themselves (this

    manual-mode is allowed too, for instance when prosumers inject in the

    system their own articles); or ii) the injector component according to

    some well-defined criteria. For instance giving higher concentration i)

    to atoms extracted from the title or the summary of a news rather than

    those taken from its body, ii) to atoms comparing more times inside

    the same news source or even iii) to newest news as done regarding

    injection rate.

    Mind that in the case of manually-given concentration, injection rate

    too should be given (eventually later self-adjusted by the system au-

    tonomously).

    Management. The model this thesis wishes to build has to provide the ab-

    stractions and metaphores useful to every possible system designed

    upon it with the aim to help information specialists to manage their

    knowledge. In particular such system could be a self-organising sys-

    tem able to autonomously evolve knowledge according to users needs,

    desires and behaviour. For instance it could relate atoms one to each

    other to shape molecules of knowledge, hence higher-level knowl-

    edge items, evolving both according to space and time patterns such

    as decay and diffusion.

    The main tool thanks to which the system evolves (or better its atoms

    and molecules but seeds too evolve) is the chemical-like law, namely

    a one-shot stochastic transition rule consuming a set of reagents to

    generate a set of products. These rules have necessarly to be stochastic

    to give the system as a whole the self-* properties highly desiderable

    in open and distributed knowledge-intensive environments [4]. Stochas-

    ticity here means that each law has an associated somehow computed

    probability according to which it is scheduled for execution, namely

    even less probable laws could be executed despite more probable ones.

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    14/83

    14 Background

    Such laws could be designed to combine together somehow related

    atoms, ending to increase the knowledge stored within the system by

    emergence [5]. Their reagents could be the atoms of knowledge while

    their chemical products the afore-mentioned molecules: this way the

    system could be able to self-produce molecules i) about the same peo-

    ple, ii) covering the same topic, iii) relating chronologically coherent

    atoms, iv) following some kind of spatial criteria and so on.

    Concentration of atoms taken as reagents influences execution proba-

    bility: the higher is the concentration of the atoms involved in a certain

    law rather than that of atoms satisfying another laws pre-conditions,the higher is the probability that the former law will be chosen for exe-

    cution over the second (althought still stochastically, hence the second

    could be executed despite its lower probability).

    Consumption. The creation of new knowledge from existing one by emer-

    gence is useless if such knowledge is not made available to potential

    consumers. To this end, the system should provide users some mecha-

    nism to perceive such knowledge, hence both the single atoms andtheir aggregations, namely the molecules. This way system prosumers

    may not only acquire the single pieces of information they were looking

    for, but also navigate associations between them, reified as molecules.

    A crucial principle to understand when talking about self-organising

    systems is that perception actions carried out by users have practical

    and observable consequences on the system state and behaviour: as

    soon as the system is observed it suddenly changes its shape according

    to such observation.

    In the case of creation, modification or aggregation of existing informa-

    tion by the prosumer, it is easy to detect system changes, but if such

    information is only retrieved, browsed and/or navigated through with-

    out any modification which are these observable consequences and how

    they can be recognized? What is common for all the afore-mentioned

    operations, both modifying or not knowledge, is that through them

    users become aware that the considered information exists and implic-

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    15/83

    1.2 The biochemical metaphore 15

    itly evaluate such knowledge as useful/relevant to them. The system

    is then allowed to interpret all these different kinds of access made to

    atoms and molecules as positive feedbacks that increase their con-

    centration: pieces of news managed more times and more often than

    others are implicitly considered as more relevant/useful by the system

    itself, hence they will gain an increased capability to influence its be-

    haviour.

    According to this view, prosumers can be seen as catalysts for the

    chemical reactions installed in the self-organising system, able to in-

    fluence its autonomous and stochastic behaviour not only due to thenature of their actions but also to the rate at which they are executed.

    Pay attention to another fundamental principle, dual to the previous:

    even the absence of any observation could be interpreted as an action

    over the system, that as such has to change its state. This is usually

    called negative feedback: an atom or molecule of knowledge that isnt

    accessed for a long time does not receive any re-enforcement, hence

    it should slowly fade away following some kind of implicit negative

    feedback enacted by the system itself to avoid divergence (all the atomsand molecules endlessly increasing towards system saturation).

    Now that the reader knows what I had in mind while writing this thesis, its

    time to introduce the biochemical metaphore I will rely on.

    2 The biochemical metaphore

    No matter whether one thinks at natural systems using specific viewpoints,

    e.g., in terms of physical systems, chemical systems, biological systems, or

    social systems. In all of the perspectives one can always recognise the follow-

    ing characteristics: i) above a common environmental substrate (defining

    ii) the basic laws of nature and the ground on which individuals can

    live), iii) individuals of different kinds (or species) interact, compete, and

    combine with each other (in respect of the basic laws of nature), so as to serve

    their own individual needs as well as the sustainability and the evolvability

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    16/83

    16 Background

    of the overall system.

    This is the sort of endeavour that one should assume towards the realisa-

    tion of long-lasting (ideally eternal) adaptive service ecosystems: conceiving

    services and data components as individuals in an open ecosystem, in which

    they interact according to a limited set of eco-laws to serve their own in-

    dividual purposes in respect of such laws [6] [7].

    Within the ecosystem, the level of species is the one in which all system

    entities - persistent and temporary knowledge/data, contextual information,events and information requests, and of course software service components -

    are all interpreted with the uniform abstract view of being the living things

    that populate the system. After a bootstrap phase in which the ecosystem

    is expected to be filled with a non-empty set of individuals, the ecosystem

    starts living on its own, with the population of individuals evolving in differ-

    ent ways: i) the initial set of individuals is subject to changes (as a reaction

    to users actions upon it); ii) service developers and producers inject in the

    system new individuals (developers insert new services and virtual devices,

    producers insert data and knowledge); and iii) consumers keep observing the

    environment for certain individuals (inject information requests and look for

    certain data, knowledge, and events).

    The environmental level determines the set of fundamental eco-laws re-

    sponsible for the way in which individuals interact, compose with others,

    aggregate so as to form or spawn new individuals, and decay (ultimately win-

    ning or losing the natural selection process intrinsic in the ecosystem). Start-

    ing from the unified description of living entities - the information/service

    they provide - and from proper matching criteria, such laws basically spec-

    ify the likelihood of certain spontaneous evolutions of individuals or groups

    of individuals.

    Typical patterns that can be driven by such laws may include: temporary

    data and services decay as long as they are not exploited until disappearing,

    and dually, they get reinforced when exploited; data, data requests, and data

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    17/83

    1.2 The biochemical metaphore 17

    retrieving services might altogether match, hence spawning data-found

    events; new services can be created by aggregating existing services whose

    descriptions strongly match.

    The dynamics of the resulting ecosystem is overall determined by having in-

    dividuals in the ecosystem act based on their own internal goals, yet being

    subject to the eco-laws for their actions, interactions, and survival. The way

    eco-laws apply may be affected by the presence and state of other individuals,

    hence providing for closing the feedback loop that is a necessary charac-

    teristic to enable self-organisation, self-adaptation, and self-management

    features.

    For instance, a service component that gets consuming too many resources

    can affect the behaviour of resource provider components, diminishing their

    availability, and thus avoiding the overall system to crash. Or, in a different

    case, a service component being subject to a very high number of requests

    can either aggregate new service components of the same class at a different

    site or simply spawn itself to increase service availability without affecting

    the quality of service provided.

    In any case, the openness of the architecture does not exclude the possibility

    of enforcing forms of decentralised human management (the existence of a

    self-managing system must not preclude the possibility for humans to pre-

    serve the capability of controlling the system). In particular, the injection of

    new individuals can be used to modify the way eco-laws affect other individ-

    uals and, thus, to somehow control the evolution of the ecosystem dynamics.

    Chemical metaphores consider that the species of the ecosystem are sorts

    of computational atoms/molecules, living in localised solutions, and with

    properties described by some sort of semantic descriptions, intended as

    the computational counterpart of the description of the bonding properties

    of physical atoms and molecules. The laws that drive the overall behaviour

    of the eco-system are sort of chemical laws that dictate how chemical reac-

    tions and bonding between components take place to realise self-organising

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    18/83

    18 Background

    patterns and aggregations of components. Moreover, chemical metaphores

    support forms of external control using sort of catalysts or reagent com-

    ponents affecting the behaviour of a chemical ecosystem.

    But the chemical metaphore alone is not enough, because it does not consider

    any spatiality-related aspect, hence a metaphore inspired by biochemistry

    (combining basic aspects of chemistry with some feature of biology) can suit-

    ably enhance it to address the development of distributed service ecosystems.

    On the one hand, chemistry appears a simple yet powerful framework for

    self-organisation since it is based on a very foundational setting of chemical

    substances and reactions, and it allows for a well-known fully-computational

    description as a continuous-time stochastic system [8]. On the other hand,

    when moving from chemistry to biology (hence considering biochemistry)

    the notion of space structure enters the picture, and allows us to tackle in a

    self-organised way key aspects related to how individuals can spread in the

    network topology - a crucial issue for service ecosystems.

    Now that I framed the metaphore to use within the biochemical world, lets

    deeply describe the mapping from the three general concepts of species, en-

    vironmental substrate and laws of nature to the correspondant biochemical

    counterparts, hence reactants, compartments and biochemical laws.

    Species as reactants. A chemical system is composed of chemical sub-

    stances (or reactants): a chemical substance s can be considered as

    made of a certain molecule m with concentration c floating in a given

    portion of space, possibly in solution with many other substances s1; ...;sn. Concentration is directly responsible for the rate at which s reacts

    with other substances, and ultimately, on whether/how it affects the

    chemical dynamics at all. Substances may be produced, decay, combine

    with others, act as catalysts, inhibitors, signals, data storage, and so

    on.

    The concept of chemical substance can hence be associated with that of

    an individual: the molecule kind m is the individual kind, its structure

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    19/83

    1.2 The biochemical metaphore 19

    provides all interface information used to characterise the individuals

    observable behaviour, while the concentration c is a numerical value

    representing the activity level of the individual - the higher it is, the

    more likely this substance will interact with others, and dually, it will

    become inert as activity level fades. Accordingly, individuals can be

    injected in the system and start interacting with others, by changing

    shape, diffusing, being continuously generated/sustained or decaying.

    The environmental substrate as a set ofcompartments. A chemical system

    is typically made of a single solution where different substances float

    around and interact. To make this scenario better fitting the shape of

    distributed computing the biological concept of compartment is needed.

    A compartment is a portion of space delimited by a membrane that

    filters and regulates whether and how chemical substances can cross it.

    Many compartments can exist into a system, in principle hosting to-

    tally different substances and chemical reactions, thus possibly playing

    different roles in achieving the overall system objective. Compartments

    can even touch each other so that substances can move from onecompartment directly to the other, like in cells of a tissue.

    The concept of compartment can be associated with that of world

    location, i.e., an execution context for ecosystem services. A main

    example of location is a network host, with touching compartments

    modelling direct connection between nodes.

    Laws of nature as biochemical laws. In biochemistry there are two basic

    kinds of events that affect a system evolution: purely chemical reactions

    responsible of changing the concentration of chemical substances, and

    biomechanical actions responsible ofconfiguration changes - namely,

    topological changes or chemical substances moving across membranes.

    The first kind of events are well understood and studied even in the

    context of Computational Systems Biology (CMSB) [9] - starting from

    the work of Gillespie [10] and followed in languages like stochastic

    p-calculus [11]. They are ruled by reactions of the kind X + Y r Z,

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    20/83

    20 Background

    meaning that when one molecule X collides with one molecule Y they

    can interact by creating a new molecule Z (replacing the two original

    ones), with a likelihood value expressed by reaction rate r - the actual

    rate at which that reaction occurs being proportional to r, and to the

    concentration of X and Y.

    The second kind of events, biomechanical ones, are inspired by the work

    in [12] to extend the mechanism of chemical reactions. The idea is to

    allow standard chemical reactions to produce - other than chemical sub-

    stances - also biomechanical actions, which are triggers that can make

    some substance cross a membrane (hence diffuse to another networknode).

    The reader may have recognized some feature of the biochemical metaphore

    to be already-mentioned in the previous section, when I was describing my

    vision of the model/system. Such correspondances are a first hint to the

    complete mapping from the biochemical general framework above to my

    molecules of knowledge model that will be formalised in Chapter 2.

    In next section, a possible approach about how to ground the biochemical

    metaphore into the journalism application domain and in particular its stan-

    dards and methodologies is given.

    3 IPTCs news standards

    The IPTC (International Press Telecommunications Council) [13] is a consor-

    tium of the worlds major news agencies, news publishers and news industry

    vendors. It develops and maintains technical standards for improved news

    exchange that are used by virtually every major news organization in the

    world (among which the italian ANSA, the american Thomson Reuters and

    the english BBC - see [14] for the full list).

    One of the objectives for which the IPTC was established is to study tech-

    niques, research and developments in telecommunications and to consider

    how they can best be used to improve the flow of news. The following sec-

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    21/83

    1.3.1 NewsML 21

    tions describe two of its main standards designed to represent, organize and

    exchange news with the aim to achieve such objective.

    3.1 NewsML

    NewsML [15] [16] is a media-type agnostic news exchange format standard

    to convey not only the core news content, but also data that describes the

    content in an abstract way (i.e. metadata), information about how to handle

    news in an appropriate way (i.e. news management metadata), information

    about the packaging of news themselves, and finally information about the

    technical transfer itself.

    It provides a set of useful abstractions:

    the News Item - it veichles the news content, hence information report-

    ing about what has just happened, providing a preview on what one

    can expect to happen next and corresponding background information.

    Althought this information can be presented in different journalistic

    styles - article, blog post, report, comment, ... - and by different media-types - like text (articles), photo, graphics, audio or video - this single

    abstraction is conceived to cover all these cases;

    the Concept Item - since news are about events, persons, locations, or

    themes and the like and such information is worth to be remembered

    - and referred to - along with the news content to better identify, rec-

    ognize, categorize - namely, manage - it, a data structure to collect all

    this worth-to-be-remebered information is needed;

    the Package Item - it is made to convey a structured set of items. It is

    not merely a simple wrapper for news or concepts but has a feature to

    structure information like by a table of contents: a package can have

    groups of items and the groups itself can have sub-groups; each group

    can have references to multiple items and references can be named like

    Top 10 news of the week and the like;

    the Knowledge Item - it is a container for many concepts, acting like

    an encyclopaedia. This way a small, medium size or even large set of

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    22/83

    22 Background

    concepts can be distributed to receivers of news items to provide basic

    knowledge about all the terms the news item refers to.

    Briefly it could be said that the News Item is meant to be a comprehensive

    container for a single news article as much as possible, conveying both

    metadata tags and inline tags along with the news content. Metadata tags

    carry all the information regarding the news item as a whole, such as its on-

    line version URI, the author(s), the publication date and the covered topic(s);

    inline tags instead are spread throughout the content of the news both to

    give it a well-defined structure and also to carry all the additional informa-

    tion that may be useful to better understand it and characterise even a singleterm inside it.

    Having the capability to express and pack together all this information is

    pretty much useless if there is no agreement upon its meaning. Moreover, it

    should have a machine-readable representation to be succesfully processed

    and exchanged by means of some automatic tool. This second issue is soon

    addressed thanks to the eXtensible Markup Language [17], choosen by the

    IPTC as the first implementation language for its standards (althought theycould be implemented in any other language). The issue about the shared se-

    mantics is addressed by the IPTC with a couple of tricks: the afore-mentioned

    Concept Item abstraction and the NewsCodes. Here follows how.

    Values for metadata can be controlled or uncontrolled, and it is often desider-

    able for metadata values to be controlled, that is restricted to a value or range

    of values. One obvious reason for doing so is to convey clear and unambigu-

    ous information about content. If a provider needs to inform a customer that

    the content is a photograph, what term should be used: photograph, photo,

    picture, pic? They might be understood by a human reader, but ad hoc terms

    may not be processed reliably by software.

    To this end the IPTC maintains sets of Controlled Vocabularies (CVs) that

    are collectively branded NewsCodes [18]. These represent concepts that de-

    scribe and categorise news objects in a consistent manner. By standardising

    on NewsCodes, providers can ensure a common understanding of news con-

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    23/83

    1.3.1 NewsML 23

    tent and a greater degree of inter-operability between content from different

    providers.

    Concepts are the generic term used by the IPTC to denote real-world enti-

    ties, such as people, organisations and places, and also abstract notions such

    as subject categories. Then Concept Items are a model for managing this

    information and making it available via CVs, enabling a single piece of news

    content to be linked to a network of information resources. Using Concept

    Items, both the news and the entities found in them can be easily identified

    to make the content more accessible and relevant to peoples particular infor-

    mation needs. NewsML Concepts are powerful because they bring meaning

    to news content in a way that can be understood by humans and processed

    by machines. This model aligns with work being done at the W3C and else-

    where to realize the Semantic Web [19] vision.

    Concept Items, being usable as metadata values, may be either uncontrolled

    or controlled. Controlled concepts are managed by an authority (an organ-

    isation or company) and are maintained in Controlled Vocabularies. They

    are identified by a Concept URI, and their scope is global. Uncontrolled con-

    cepts are identified by a literal string; their scope is local to the containing

    document. Every concept, whether controlled or uncontrolled must be iden-

    tified, and the identifier used must be unique in its scope. NewsML specifies

    that the Concept URI must be a URL and that it should resolve to human-

    readable and machine-readable information about the concept.

    As someway related News Items could be packed together in a single Package

    Item with the purpose to organize them, then all the Concept Items useful

    to a certain common scope or describing the same entity could be collected

    in a single Knowledge Item acting as an ontology both human- and machine-

    readable.

    Describing in detail how each of the four Items above works and their full

    tags list is out of the scope of this brief introduction and anyway it will

    be useless for the remainder of the thesis. Hence I will take some step fur-

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    24/83

    24 Background

    ther in the explanation only for the News Item, which in the very end is

    the real news, and for the Concept Item, because it is responsible to give

    machine-processable semantics to a news, a feature upon which I will rely in

    my molecules of knowledge model.

    The macro structure of a NewsItem is composed by four tags:

    is the root element. It wraps anything else, including the other

    three tags here listed, and carries some crucial information such as a

    unique ID for the document, the XML namespace(s) and the News-

    Codes catalog reference(s), used by NewsML interpreters to resolveConcept Items URIs;

    carries the so-called management metadata, hence additional

    information about news management such as its area of interest (a kind

    of broad-topic), the provider of the news and its publication status

    (wether it is usable, suspanded or cancelled);

    wraps both administrative and descriptive metadata. Both

    regards the news content, but while the former is about the source

    of the news, its urgency, and the like, descriptive metadata is strictlyconnected to the content, storing for instance its covered topic(s).

    is meant to wrap any media type, althought it is better to

    phisically store only text leaving other media types, such as audio and

    video streams, as external references (NewsML has dedicated wrappers

    for photos, audio and video, similar to the NewsItem).

    One interesting thing about the content of a NewsItem is that text could

    be further tagged using other standards, for instance the NITF described in

    next section.

    The ConceptItem is quite similar to the NewsItem because it has the same

    and sub-sections. Whats new is the

    element which is a wrapper for the properties that express it in detail. The

    following further tags are used to define a concept:

    is the unique identifier of the concept, stored in the form of a

    QCode. QCodes consist of two parts, separated by a colon: the first is an

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    25/83

    1.3.1 NewsML 25

    alias (scheme) that can be used to identify the IPTC NewsCode vocab-

    ulary involved (for instance ninat stands for newsItem nature, hence

    concepts about tha nature of a news); the second part of the QCode is

    a reference into the vocabulary, hence one of its entries. Scheme aliases

    are resolved by looking in an online Catalog. The reference(s) to cat-

    alog(s) are carried at the root level of a NewsML document in the

    correspondant tag ;

    is the name of the concept in natural language;

    and describe the nature of a concept. Both properties

    demonstrate the use of the subject, predicate, object triple derived from

    RDF [20] to express a named relationship with another concept. The

    difference between the two properties in application is that can

    only express one kind of relationship: is a. The current types agreed

    by the IPTC and contained in the concept nature CV are:

    cpnat:abstract for an abstract concept;cpnat:person for a person;cpnat:organisation for any kind of company;cpnat:geoArea for a geopolitical area of any size;

    cpnat:poi for a somehow defined point of interest;cpnat:object for every objects (similar to the NITF pur-

    pose, see later on);cpnat:event for a newsworthy event.

    A uses either a @qcode or @literal to additionally describe

    other inherent characteristics of a concept in terms of a named rela-

    tionship with another concept. Such relationship may be identified in

    the @rel attribute by a QCode; in this case a controlled vocabulary of

    relationships, either maintained by an organisation such as the IPTCor custom-defined, would also be required.

    allows to enter more extensive natural language information,

    even with some mark-up if required.

    The opportunity given by NewsML to the user to shape their needed con-

    cepts, collect them in a KnowledgeItem and use them in their markup, both

    for news metadata and for news content, is a great step toward interoperabil-

    ity and automatic semantic processing of knowledge. Particularly important

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    26/83

    26 Background

    are the and tags along with the @rel attribute: their combi-

    nation actually allows to shape a whole ontology as related ConceptItems!

    Before going on to NITF standard, I wish to highlight one thing. In the

    Introduction I described five areas of opportunity for which computer science

    could help journalism and I stated that my work in this thesis would focus

    on Document exploration and redundancy by helping journalists to manage

    news and find stories. Please notice that other issues such as Combining

    information from varied digital sources and Audio and video indexing can

    be addressed simply by a wide-spread adoption of the NewsML standard: it

    allows in fact to structure any kind of news source according to the same

    set of tags, hence promoting different news sources interoperability, and has

    dedicated newsItem-like objects to convey any kind of media, be it pictures,

    video streams or audio files, thus making less-necessary to perform indexing

    because relevant information are carried as metadata.

    3.2 NITF

    The NITF (News Industry Text Format) [21] uses the eXtensible MarkupLanguage (XML) to define the content and structure of news articles. It sup-

    ports the identification and description of a number of news characteristics,

    among which the most notable are:

    Who owns the copyright to the item, who may republish it, and who its

    about;

    What subjects, organisations, and events it covers;

    When it was reported, issued, and revised;

    Where it was written, where the action took place, and where it may bereleased;

    Why it is newsworthy, based on the editors analysis of the metadata.

    From the few examples given for each of the news facets listed above, it is

    clear that the NITF is able to express both additional information about the

    content of the news and also metadata regarding the news lifecycle. More-

    over, it supports most of the usual plain HTML tags for text structuring.

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    27/83

    1.3.2 NITF 27

    A NITF document is organized according to its main tags:

    is the root element of the document, hence carries attributes to

    identify the document, its time and date metadata and its category. It

    must contain a head and a body;

    holds the metadata about the document as a whole, such as its

    , the subject covered thanks to tag,

    and , its potenital area of interest through the tag and a list of items;

    is the content of the document and is divided into the three follow-

    ing sub-sections;

    could contain either metadata useful to be displayed, such as

    the author and contributors to the news article, or an abstract/summary

    of the paper;

    is the actual content of the news, hence it typically contains

    text, references to pictures/videos, quotes and every inline tag and

    HTML tag supported by the NITF.

    is similar to in that they both could contain ad-

    ditional information to be displayed. This usually carries a tagline or a

    bibliography.

    Since NewsML too has the capability to properly manage news-related meta-

    data, the NITF someway overlaps. The best thing to do, is to exploit the

    NewsML standard to wrap a single news articles content and its metadata

    into a properly-structured container, that is the along with its

    afore-mentioned metadata sub-tags (hence and ).

    Then the NITF should be used to enrich the content of the news through its

    inline tags, that is something NewsML cant do.

    NewsML in fact provides no support for HTML tags to structure a doc-

    ument neither any form of inline tagging to add information to the plain

    text, for instance with the purpose to ease the work of any text mining algo-

    rithm usable to automatically process the document. In this sense the NITF

    and NewsML are complementary standards, hence they perfectly combine to

    shape a very comprehensive and coherent framework to manage the whole

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    28/83

    28 Background

    news lifecycle: comprehensive because while one cares about news overall

    structure, including metadata, the other focusses on their internal meaning

    making it unambiguous; coherent because they both exploit the same IPTC

    abstractions, for instance the NITF too makes usage of the NewsCodes tax-

    onomies.

    Heres the list of some of NITF most used inline tags, called by the IPTC

    semantic units:

    wraps personal names, both living people and fictitious. It could

    contain the tag if the tagged person goes along with its

    public role throughout text. Pay attention when some peoples name

    is used as a company name or as an object definition, such as the

    Thomson Reuters and a Picasso painting: in such cases use the proper

    tags and ;

    typically marks full official titles, such as the correct denotation

    of political, commercial, clerical, military, civil appointments but is also

    usable for their synonyms and journalistic variants. Such tag may be

    even used to identify members of a profession (job titles) and with

    family relations like father, wife as well as for other kinds of roles

    such as consultant, employer and the like. The tag may

    further be used to identify important (named) or indicative (unnamed)

    players in recurring news-relevant scenarios, such as elections (the first

    candidate), trials (the special prosecutor), accidents (the driver) and

    natural catastrophes, business, cultural or sport events;

    serves to identify organisational names. An inner tag ()

    allows to add special widely agreed-upon codes, such as codes from the

    Standard Industry Classification (SIC) [22] list or even NewsCodes. It

    also covers personification of organisations, as in phrases such as the

    Government said. Pay attention when some peoples name or even a

    location is used as organisation, for instance in phrases like The Nobel

    committee decided... or The White House stated that.... Watch out

    also for product names such as The new BMW Z4 sport car... which

    calls for the proper tag;

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    29/83

    1.3.2 NITF 29

    identifies geographic locations and significant places. It either

    contains mere text or structured information thanks to its possible

    inclusions , , , and .

    It may also comprise significant man-made structures, such as famous

    buildings and constructions, bridges, walls, buildings, highways and the

    like. As already said, watch out for possible confusion with the

    tag and keep in mind to use the proper tag for special cases

    such as the Chernobyl catastrophe;

    should be limited to newsworthy events or events that carry news

    value in the sense of journalism. Factors of news value are for in-stance significance, proximity, prominence of the involved persons, con-

    sequence, unusualness, human interest and timeliness. The possible am-

    biguity with the tag has been already described above.

    should include named news-relevant world objects as publica-

    tions and media types (books, newspapers, CDs, TV series), mass me-

    dia channels (TV channels, radio stations), titles of awards and prizes,

    names of products and product lines, art objects, animals, ships, build-

    ings and so on. It could virtually tag anything that is newsworthy andthat no other tag could wrap. It may seem a bit under-constrained,

    but it gives the journalist the opportunity to tag specific-interest terms

    even according to a controlled vocabulary. For instance, if the news

    talks about cancer, then the journalist (or even a software agent) could

    exploit either an ad-hoc or a well agreed upon medical ontology and

    tag every interesting term recognized from it, so to allow semantic rea-

    soning over the news content!

    tags concrete dates and days of the week, religious and bank hol-idays, and relative time expressions that may be attributed with a

    concrete date such as Christmas Eve and the like.

    Thanks to these pre-defined tags and to the opportunity to make their values

    constrained to some kind of controlled vocabulary, be it from the NewsCodes

    or an ad-hoc ontology, the user of the NITF standard has a great expressive

    power about news content enrichment.

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    30/83

    30 Background

    As the NewsML standard could do, the NITF too can address at least one of

    the open issues listed in the Introduction: Information extraction. If a doc-

    ument is properly NITF-tagged, then its worth-to-remember entities are all

    machine-processable items since every NITF tag has a well defined mean-

    ing and their values too could be formally defined through taxonomies as

    the NewsCodes. NewsML and NITF wide-spread adoption could alone face

    many problems regarding news management and sharing.

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    31/83

    Chapter 2

    Molecules of knowledge model

    Non ho fallito, ho trovato mille modi

    per non costruire una lampadina

    - Thomas Edison -

    Now that all the necessary knowledge to deal with the molecules model has

    been acquired, I wish like to give the reader a brief and informal description

    of such model, highlighting the main entities and their counterparts drawn

    from the biochemical metaphore and from NewsML and NITF standards.Then, for each of these entities, possible requirements are devised and a

    first specification that fullfills them is given. Finally, the formal molecules of

    knowledge model is detailed.

    1 Informal introduction to the model

    At the beginning of the previous Chapter I gave the reader my vision both of

    the model to conceive and of a possible self-* system designed upon it. Suchvision was outlined according to three different phases of a news lifecycle,

    that are production, management and consumption. Here I would like to

    recall such phases to introduce the main entities of the model, which are

    inspired by the biochemical metaphore and grounded into the NewsML and

    NITF IPTCs standards.

    Production. Assumed that every news source exploited by the system pro-

    sumers is properly structured according to NewsML and NITF stan-

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    32/83

    32 Molecules of knowledge model

    dards , I will also assume that such sources are reified within the system,

    hence in the model too, as seeds both if they are external or internal.

    According to the biochemical metaphore, such seeds can be seen both as

    catalysts and as atoms: catalysts because their presence affects the sys-

    tem behaviour through their continuous injection of knowledge atoms;

    atoms because nothing forbids the system to manipulate them as they

    were pieces of knowledge themselves, rather than news sources. The ex-

    istence of seeds is extremely important because atoms may fade, hence

    information will be lost forever in their absence. Moreover, reifying news

    sources as seeds allows to keep all the relevant knowledge inside themodel/system, while any kind of interface agent doing seeds job would

    make such knowledge external, hence dependant on agents availability

    (upon which the system could have no control).

    A first fundamental entity of the model is hence the seed. Its counter-

    part in the IPTC standards could be the News Item as a whole, since

    it represents a single source of knowledge. Moreover some of its poten-

    tially worth-to-remember properties could be described by NewsML

    tags such as to identify the provider (for instance ANSA), for the date, to describe where it is lo-

    cated, for its author and .

    Created and injected by the seed, another one of the main model entities

    is the atom (of knowledge). Its biochemical counterpart is clear: it is

    one of the reagents living in the solution represented by the set of all

    the atoms that co-exists in a given chemical compartment. As such, it

    will have a concentration value associated, as the chemical metaphore

    wants.

    Atoms do actually have a clear counterpart in NewsML and NITF stan-

    dards: the tag. Tags can in fact be seen as the atoms that altogether

    compose the news-substance. Hence it is possible to see living within

    the system atoms, atoms, atoms,

    atoms and almost every other NewsML/NITF tag.

    Management. Now that the system is full of wandering atoms, each gener-

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    33/83

    2.1 Informal introduction to the model 33

    ated by its parent seed at a certain rate, they will end to collide, ei-

    ther randomly or driven by some well-defined mechanism. The outcome

    of these inter-atom interactions are the third fundamental entity of

    the model: the molecule of knowledge. According to the chemical

    metaphore, molecules could be seen as composite substances in which

    there arent many instances of the same atom, that means a single

    species of atom with as many individuals as its concentration value,

    but many instances of different atoms.

    Molecules are spontaneous, stochastical, environment-driven aggre-

    gations of atoms, possibly reifying some meaningful similarity betweenthem, hence adding new knowledge to the system. They are sponta-

    neous in that they simply happen as a natural evolution both of the

    internal system behaviour and of the prosumers interactions; stochas-

    tical as required by the chemical metaphore grounded in the work of

    Gillespie [10], which allows for the emergence of a plethora of self-

    something properties, above all self-adaptation; driven by the environ-

    ment because althought stochastical, their likelihood to actually take

    place is modulated both by other molecules/atoms living in the com-partment and by catalysts that could intervene.

    The role of driving such aggregations is taken by another fundamen-

    tal abstraction of the model: the chemical reaction. The name is

    quite self-explanatory about their biochemical inspiration: they are the

    transition rules, namely the chemical-like laws, that the chemical en-

    gine reified by the system enacts to evolve itself, that is the atoms and

    molecules (and even seeds too) it stores. Since they are meant to cre-

    ate molecules, they must necessarly be spontaneous, stochastical andenvironment-driven, exactly as described above (and in the chemical

    metaphore section of previous Chapter).

    Both entities could be grounded to the NewsML and NITF standards:

    since molecules are bags of atoms they are actually bags of tags,

    hopefully somehow related tags; since molecules should hopefully be

    meaningful, chemical reactions that generate them should not be com-

    pletely blind to the nature of their reagents. In other words they

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    34/83

    34 Molecules of knowledge model

    should not be purely random transitions. Such chemical laws applica-

    tion may be influenced by structural relationships about their reagent-

    tags, relationships that actually exists in NewsML and NITF: for in-

    stance a tag is always inside a tag and

    describes metadata regarding a tag.

    Moreover, semantical relationships between tags values may be taken

    into account too, since both NewsML and NITF give to the user the

    ability to draw such values from either controlled vocabularies or even

    full ontologies.

    Consumption. As already said, users of the model/system are prosumers,

    hence they want also to consume knowledge rather than solely produce

    it. Prosumers should be able to retrieve all the pieces of knowledge

    stored within the system, access them to inspect their content and

    navigate their relationships in the case they are molecules, combine

    them to create their own new knowledge and so on.

    Notice that every time a prosumer uses an atom/molecule, such us-

    age action has other effects beyond the actual consequences of thecomputation. As already said they can be interpreted by the systems

    chemical engine as positive feedbacks to the relevance/usefullness of an

    atom/molecule, hence they should influence the correspondant concen-

    tration. Lack of actions too is a feedback, this time a negative feedback

    that should make atoms and molecules decay as time passes.

    Due to all these possible side effects both on systems state and be-

    haviour (remind that seeds too can be accessed and manipulated, for

    instance their injection rate & concentration), prosumers interactingwith the knowledge can be seen as catalysts/inhibitors, the last main

    entity of the model directly drawn from the chemical metaphore. They

    wont have any NewsML/NITF counterpart, since they are the journal-

    ists using such standards, or even automatic processors (agents) able

    to interact with the knowledge stored in the system.

    Summing up, the molecules of knowledge model is designed around the fol-

    lowing abstractions:

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    35/83

    2.1.1 About topology 35

    seeds the news sources;

    atoms the NewsML/NITF tags; molecules possibly meaningful bags of tags;

    chemical reactions the reifications of the (possibly useful) rela-

    tionships among the tags in a bag of tags;

    catalysts/inhibitors the journalists, prosumers of knowledge.

    1.1 About topology

    Before next section in which each of these abstractions is detailed, I wish tofurther describe one aspect of the molecules of knowledge model/system that

    has been only mentioned until now: distribution.

    If the reader remembers, in the first Chapter I stated that the chemical

    metaphore alone wont be enough for my model, because it doesnt account

    for any kind of spatial aspect to be considered thus managed. Then such

    metaphore was completed with the concept of chemical compartment

    drawn from biology, leading to the biochemical metaphore able to model andproperly deal with network topology related issues.

    I would like to remark here that such enhancement has not been done merely

    to give more expressive power to the model, but that it is strongly encouraged

    by the nature of the problem it tries to face, that is knowledge management

    in general. In fact, nowadays it is quite an utopy to design a knowledge man-

    agement system that is not distributed among different computational nodes,

    possibly crossing administrative domains and located at different places.

    Moreover my elected application domain is journalism, where distribution

    plays an essential role too. A possible use case for the molecules of knowl-

    edge model could be to help journalists working in a journalistic heads news-

    room: they will probably have their own personal devices (be them laptops,

    tablet or whatever) in which they store their news sources, annotations, self-

    produced articles and the like. Then the model with all its abstractions could

    be installed in every one of this devices, transforming each of them in a

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    36/83

    36 Molecules of knowledge model

    single chemical compartment, hence with its own seeds, atoms, molecules and

    chemical reactions, situated somewhere within the whole network of all the

    other chemical compartmentes, that is all other journalists (notice that this

    will be a mobile network actually).

    For these reasons, from now on I will always assume a distributed network

    topology to which apply the molecules of knowledge model, in which every

    node is the chemical compartment belonging to a precise prosumer (hence

    influenced by a well defined catalyst), in which he/she stores his/her own

    seeds, atoms, molecules and chemical reactions.

    In Section 3 I will talk about spatial interactions and I will describe how to

    exploit distribution thanks to neighborhood relationships between com-

    partments and atoms/molecules diffusion mechanism (in truth I will only

    mention such relationships, because I will rely on a cited paper).

    2 Model abstractions

    In the following sections, each of the model abstractions just highlighted will

    be given a set of requirements to satisfy according to the main goal of this

    thesis. Along with such needs, also possible solutions are described and a first

    pseudo-formal specification is given too.

    2.1 Seeds

    Seeds requirements can be devised directly from the brief introduction given

    at the beginning of the Chapter. Since they are the reification of any news

    source that a journalist would like to consider in his/her knowledge port-

    folio, they should carry some information about it. Moreover, they are re-

    sponsible for the injection of atoms of knowledge, hence they should store

    meta-information about this process too.

    Focussing on news source identification and description, NewsML and the

    NITF standards provide a number of tags that are potentially useful: ,

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    37/83

    2.2.2 Atoms 37

    , etc. are just a few of the many previously mentioned. Some kind

    of unique identifier for the news source is undoubtely necessary too: since

    I wish to reuse as much as possible features from NewsML standard, I will

    rely on URIs, which have the advantage to be highly encouraged by the W3C

    for the Semantic Web vision, for instance in its OWL language. Then, this

    collection of tags, along with their content, could be the first information to

    store into a seed, fullfilling the first requirement.

    Regarding the injection mechanism, three essential information should be re-

    mebered: i) first of all, the atoms to be spawned (whose internal structure is

    detailed in next section); ii) then, the concentration of every atom to create,

    so to generate the exact quantity of each at every injection step; iii) finally,

    the injection rate, to generate each atom at the right frequency/probability.

    Putting these observations altogether, the following could be a first pseudo-

    formal specification of a seed element (I will use a Prolog[23]-like syntax for

    its readability):

    seed(srcID, srcMeta, [atoms

    ], [concentrations

    ], [rates

    ])

    where:

    srcID is the URI (or equivalent identifier) of the news source;

    srcMeta is the collection of the NewsML tags afore-mentioned;

    [atoms] is the list of every single atom to spawn;

    [concentrations] is the list of each atoms initial concentration (possibly

    different for each of them);

    [rates

    ] is the list of atoms injection rates (again, possibly different for eachof them).

    2.2 Atoms

    To fruitfully shape a single atom of knowledge as best as possible, the main

    goal is to balance two different competing needs: on one hand it should em-

    bed enough knowledge to be useful from both the system and the prosumers

    point of view; on the other hand the atom is the most primitive piece of

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    38/83

    38 Molecules of knowledge model

    knowledge within the model, hence it should be kept as much simple as pos-

    sible.

    I will try to reach the needed equilibrium taking into account the following

    complementary facets:

    Granularity of knowledge. While grounding the chemical metaphore into

    NewsML and NITF standards, I stated that any of their tags could be

    mapped in a single atom, hence following their structure and semantics,

    a six-level scale for the granularity of a piece of knowledge could be

    identified:

    1. the single NITF tag (finest granularity);2. a descriptive or administrative wrapper;3. the , or wrappers;4. the whole ;5. a single tag within the of a ;6. the whole container (coarsest granularity).

    Pay attention that having a single abstraction able to cover all these

    different quantitative of information may seem to overlap with the

    molecule abstraction, making it useless. This is actually wrong, be-

    cause molecules are a completely different concept: an atom may be as

    comprehensive as needed but will always be a single not-divisible unit

    of information; a molecule instead is the reification of a number of rela-

    tionships between different atoms, possibly coming from different seeds.

    Context of knowledge. Any piece of knowledge could be misleading if taken

    out of its context, because the context is the set of the environmental

    conditions needed to correctly interpret it. In other words, context

    gives or at least enriches semantics of a piece of knowledge, allowing in

    the end for a better/correct understanding of it.

    Thus it will be undoubtely useful to embed a certain degree of se-

    mantics description in an atom, rather than its content alone. Here

    NewsML and NITF standards come in hand with a couple of features:

    i) being standards their tags have a well-defined meaning, ii) since they

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    39/83

    2.2.2 Atoms 39

    are implemented in XML they are highly interoperable and easily ex-

    changeable, iii) tags values too may have a formal semantics thanks

    to NewsCodes or external ontologies (coded as Knowledge Items).

    For these reasons a first enrichment to an atoms content could be to

    store also the related NewsML/NITF tag that wraps it, but this alone

    isnt enough.

    It has been already explained how NITF tags can experience some

    kind of ambiguity about their usage, but even more problems could be

    faced. Lets think about the following phrases: Mr. Marchionne is CEO

    of FIAT and FIAT has provided a thousand new job opportunities.. Inboth cases FIAT should be tagged with the tag, but while

    in the first case it covers the role of the object, namely answering the

    question: Mr. Marchionne is CEO of What?, in the second it is the

    subject, hence the Who.

    Hence it could be useful to explicitly say which one of the famous 5

    W of journalism the current tag is describing, hence if its about the

    Who, What, Where, When or Why. Thats another useful information

    to store in an atom.

    Its not finished yet. Since NewsML and NITF tags values could be

    drawn from controlled vocabularies or even ontologies, their meaning is

    asserted unambiguously once and for all by these taxonomies. Hence,

    I could inject in an atom some information to identify them, namely

    the QCode and catalogue: both are logical names that together address

    a web page (or even a local file if their scope is local within the user

    company) in which the schema is formally defined as in machine- as in

    human- readable form.

    Relevance/Usefullness of knowledge. A definitory property of a news is its

    relevance, hence how its perceived interesting both by the professionists

    who manage it and by the target audience to whom it is directed.

    Moreover, every news has some kind ofusefullness, measured according

    to some criteria: for instance, the level of new knowledge acquired by a

    reader or even economic revenues it could generate. These are somehow

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    40/83

    40 Molecules of knowledge model

    two faces of the same coin: as more relevant news are expected to

    be more useful to readers/journalists, then useful news may spread

    through readers and publishers gaining relevance.

    Since atoms carry some piece of information extracted by a news, it

    is quite natural to distribute the relevance/usefullness of the original

    source of knowledge as a whole among the (possibly) many atoms ex-

    tracted from it.

    Another definitory property of a news is, as the word itself suggests,

    its novelty, hence both how much new is the knowledge it provides

    with respect to the actual environment and also how much new it iswith respect to time passing: it is obvious that while news become older

    and older they lose relevance and public interest, following a grace-

    ful degradation process. As done before for relevance/usefullness, this

    time-dependancy property could be easily transferred to the atoms

    of knowledge: the less they are shared and used by cooperating jour-

    nalists, the more they are going to lose their cultural/economic value.

    Since these three facets of a news, that are relevance, usefullness and

    novelty, are so deeply influenced one by each other, they all could bemodeled with a single abstraction: the concentration.

    From the biochemical metaphore in fact, it is known that an atom/molecules

    concentration is a measure of its activity level, namely how much it

    could and should influence the overall chemical behaviour of the solu-

    tion (system). Since such concentration is subject to a time-dependant

    fading mechanism, namely atoms/molecules decay, the mapping rele-

    vance/usefullness concentration is perfect!

    Summing up, an atom of knowledge should not carry only the content of a

    (piece of) news, hence the tag along with the tagged term/phrase, because

    this way its semantics could be not clear. I have identified two other pieces

    of knowledge that are worth-to-remember and useful to better veichle se-

    mantics: i) one of the 5 W and ii) the QCode and catalogue information.

    Moreover, concentration too should be explicited, so to model the atoms

    relevance/usefullness (and novelty too). As a last bit of info, since atoms are

    automatically injected by their own parent seed, it could be useful to bring

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    41/83

    2.2.3 Molecules 41

    some data from such seed to the atom.

    Here it is a possible atoms syntax:

    atom(srcID, info(tag, content), meta(w, qcode, catalogue), concentration)

    where:

    srcID is taken from the source seed;

    info(tag, content) is the actual piece of news the atom veichles, hence some

    content (from the whole paper down to a single term in it) along

    with its tag;

    meta(w, qcode, catalogue) is the additional information that helps clarify the

    atoms semantics, thus one of the 5Ws and the QCode and catalogue

    information grounded in NewsML/NITF standards;

    concentration is the actual activity level of the atom. Notice that this value

    will necessary coincide with the one specified in the source seed only at

    injection time: later on it will evolve according to the system behaviour.

    2.3 Molecules

    Molecules of knowledge may seem the most complex abstraction to deal

    with, because in the very end all other are built around them. In fact, chemi-

    cal reactions consume seed-generated atoms to forge molecules, creating new

    knowledge within the system, while catalysts inspect them to acquire knowl-

    edge.

    In truth, a very simple interpretation about what a molecule is can be given,

    assuming that chemical reactions, to whom they are deeply related and de-

    pendant, are properly shaped. How? Here follows my explanation.

    Since molecules of knowledge are reifications of interactions among different

    pieces of news, they are full of implicit semantics about such interaction.

    Moreover, hopefully molecules are composed pursuing some goal and accord-

    ing to some criteria, for instance the chemical engine could try to aggregate

    atoms similar on a topic basis, for geographical reasons or because they are

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    42/83

    42 Molecules of knowledge model

    chronologically ordered. Then the implicit meaning that a certain molecule

    carries, is actually given by the particular chain of chemical reactions that

    during time shaped it.

    Thanks to negative feedbacks, there is no need to teach the system how to

    build only useful aggregations and how to detect and discard meaningless

    ones: simply the latter will fade away as an emergent natural selection

    process, driven both by systems internal behaviour and by external pro-

    sumers interactions. Then there is no reason to explicitly state neither why

    a certain molecule has been generated nor how its atoms are related one

    to each other. In other words, the afore-mentioned aggregations semantics

    could remain implicit: if relationships are relevant/useful, they will survive

    because a number of prosumers sees some meaning in them; otherwise, if

    nobody finds them interesting such molecules will simply decay until death.

    For these reasons, the simple interpretation I am talking about is that a

    molecule of knowledge could be viewed as a bag of atoms, hence a single

    unordered set of somehow related atoms. According to this interpretation,

    a molecule could be simply shaped as follows:

    molecule([atoms], concentration)

    where:

    [atoms] is the list of all the atoms currently bondend together by the

    molecule, hence the pool of related pieces of knowledge that a certain

    chain of reactions has aggregated during natural system evolution;

    concentration is the actual concentration of the molecule.

    Please notice that every single atom inside the [atoms] list has not exactly

    the same internal structure of a standalone atom. Since it is now part of

    a greater aggregation, its concentration is no longer meaningful because the

    molecule has its own, hence it is removed from atoms syntax.

    Thus, the complete structure of a molecule (omitting a whole list of atoms

    for brevity) should be as follows:

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    43/83

    2.2.4 Chemical reactions 43

    molecule([atom(srcID, info(tag, content), meta(w, qcode, catalogue)), ...],

    concentration)

    2.4 Chemical reactions

    In the previous section, in which an informal introduction to models abstrac-

    tions was given, I stated a couple of interesting things regarding chemical

    reactions. First of all, they are responsible for the consumption of atoms and

    the production of molecules, but this is quite obvious. Whats not so obvious

    is how molecules are produced and atoms are consumed, in the sense of which

    are the criteria to bind atoms together in a molecule and the mechanisms toactually do so. Now Im going to recall these interesting things.

    First of all, since most of the NewsML and NITF tags have well-defined

    dependancy relationships, a chemical law could exploit them to pack some

    kind of NewsML/NITF-compliant molecule. For instance, the self-* sys-

    tem built upon this ongoing model could decide to pack together all the tags

    (along with their content) nested in a tag. This could hap-

    pen because they are frequently accessed together, thus the system tries tospeed-up research latency: prior to the molecule all the single atoms have to

    be retrieved; with the molecule this is done in one shot by looking directly

    for it.

    Moreover, virtually every NewsML/NITF tag could have its admissible val-

    ues collected, stored and defined formally by a controlled vocabulary or an

    ontology, hence semantical relationships too could be exploited by chemical

    reactions! When semantics enters the field of computation and interactions a

    plethora of interesting and meaningful behaviours arise to be explored. For

    instance, the chemical engine may browse tags values source taxonomies to:

    i) discover if two different terms are synonyms, hyperonyms, and the like,

    then decide to aggregate the correspondant atoms in a thesaurus molecule;

    or ii) navigate relationships among different concepts from the same ontology

    and reify such links, such as understanding that the Minister of Defense is a

    member of the Government, thus it is in the staff of the Prime Minister and

    reify such reasoning putting them both in a taxonomy molecule.

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    44/83

    44 Molecules of knowledge model

    Finally, the most obvious relationship between atoms has not to be omitted:

    if they carry the same content they are undoubtely related (maybe such re-

    lationship is trivial hence useless, but exists anyway)! For content here I

    mean the true content, hence only the tagged term or phrase without con-

    sidering the tag. This allows to relate different atoms (thus possibly different

    news sources) in which the same thing is tagged differently, for instance when

    news A says Termini Imerese is in trouble and news B says employees are

    occupying Termini Imerese factory: the first Termini Imerese tag could proba-

    bly be a because the term is used in place of FIATs Termini Imerese

    factory, while the second tag could be a tag because Termini

    Imerese is really a city.

    Summing up, a first collection of patterns to join atoms into molecules could

    be based upon:

    the tag field inside the info(tag, content) term of an atom, in the case

    of a structural relationship between different NewsML/NITF tags;

    the whole meta(w, qcode, catalogue) term if the relationship is seman-

    tical;

    the solely content inside the info(tag, content) term of an atom whenever

    a subject-based link has to be reified into a molecule.

    Now Ive answered first question from the beginning, that was about possi-

    ble criteria upon which molecules are composed. Whats left is question two:

    which mechanisms to use to aggregate atoms producing molecules?

    The answer is directly provided by the biochemical metaphore: chemical

    reactions are the tool. Im not gonna list all the possible concrete chemical

    reactions to inject in the system to obtain every possible instantiation of the

    above described patterns; Im just going to define the structure & semantics

    of a general-purpose chemical law for each of the patterns, in the sense of

    how many reagents it may have, of which kind, how they should be similar

    one each other, whats the produced substance and the like. First of all lets

    see the common look that every chemical law will have.

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    45/83

    2.2.4 Chemical reactions 45

    Following literally the interpretation of molecules as bags of atoms, a chemi-

    cal reaction simply takes a list of atoms as input reagents and produce a single

    molecule as output product. Both involved concentrations, hence reagents

    and products, are a single unit, thus a single instance of input atoms is con-

    sumed (one each) and a single instance of the output molecule is generated.

    But this way molecules cannot be part of a chemical reaction as reagents,

    hence they cannot be consumed except by prosumers. This is undesiderable,

    because molecules are living and evolving entities pretty much like atoms,

    thus nothing should forbid them to join one another or to absorb additional

    atoms.

    Adding such feature, a generic chemical reaction could look like this (omitting

    internal fields for the sake of clarity):

    ( atom | molecule ) r join molecule([atoms], concentration++)

    where reagents could be any combination of any number of atoms and molecules

    while product is exactly one molecule aggregating all the atoms on the left-

    hand side. This suggests that reagents molecules are somehow unpacked to

    extract atoms and inject them in the new molecule. Please remember what

    was said about the [atoms] list in previous section to avoid confusion regard-

    ing notations.

    Now that the most general-purpose chemical-like law has been presented, it

    is time to describe its concrete applications to obtain the afore-mentioned

    patterns. As already said, the following are still general purpose laws, be-

    cause they only state who should be similar to who for the reaction to be

    applied and similar information.

    The first chemical reaction is meant to produce molecules that aggregate

    structural-related atoms, based upon the well defined relationships among

    NewsML and NITF tags. Assuming to use apices () to denote some structural

    dependancy among tags, such chemical reaction could be as follow (omitting

    unnecessary fields to enhance readability):

    ( atom(srcID, info(tag, ), , 1) | molecule([atom(srcID, info(tag, ), ), ...], 1) )

  • 8/2/2019 Molecules of Knowledge - A New Approach to Knowledge Production Management and Consumption

    46/83

    46 Molecules of knowledge model

    r structural join

    molecule([atoms], concentration++)

    This law states that: i) only atoms/molecules all coming from the same news

    source could be bound together, ii) such reagents tag fields should have

    some dependency according to structural constraints of the NewsML and

    NITF standards. Other aspects of the law are inherited from the general

    purpose one already described, for instance one unit of concentration is in-

    volved, reagents could be in any number, input molecules should be unpacked.

    Going on to the second aggregation pattern, I assume that symbols ()and () denote some kind of semantical relationship between terms, for in-

    stance according to a thesaurus or ontology involving such terms. This kind

    of NewsCodes-based chemical reaction could be shaped as follows:

    ( atom( , info( , content), meta( , qcode, catalogue), 1) |

    | molecule([atom( , info( , content), meta( , qcode, catalogue)), ...], 1) )

    r semantical join

    molecule([atoms], concentration++)

    Such transition rule states that: i) no m