Are raw RSS feeds suitable for broad issue scanning? A science ...

of 16

Are Raw RSS Feeds Suitable for Broad Issue Scanning? A Science Concern Case Study1

Mike Thelwall, Rudy Prabowo, Ruth Fairclough School of Computing and IT, University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1SB, UK.E-mail: [email protected] Tel: +44 1902 321470 Fax: +44 1902 321478E-mail: [email protected] Tel: +44 1902 321000 Fax: +44 1902 321478E-mail: [email protected] Tel: +44 1902 321000 Fax: +44 1902 321478

Broad issue scanning is the task of identifying important public debates arising within a given broad issue; Rich Site Syndication (RSS) feeds are a natural information source for investigating broad issues. RSS, as originally conceived, is a method for publishing timely and concise information on the Internet, for example about the main stories in a news site or the latest postings in a blog. RSS feeds are potentially a non-intrusive source of high quality data about public opinion: monitoring a large number may allow quantitative methods to extract information relevant to a given need. In this paper we describe an RSS feed-based co-word frequency method to identify bursts of discussion relevant to given broad issue. A case study of public science concerns is used to demonstrate the method and assess the suitability of raw RSS feeds for broad issue scanning (i.e. without data cleansing). An attempt to identify genuine science concern debates from the corpus through investigating the top 1000 ‘burst’ words found only two genuine debates, however. The low success rate was mainly caused by a few pathological feeds that dominated the results and obscured any significant debates. The results point to the need to develop effective data cleansing procedures for RSS feeds, particularly if there is not a large quantity of discussion about the broad issue, and a range of potential techniques is suggested. Finally, the analysis confirmed that the time series information generated by real-time monitoring of RSS feeds could usefully illustrate the evolution of new debates relevant to a broad issue.

IntroductionFor many types of social science research and market research (Pikas, 2005) there is a need for information about public opinion or public reaction to certain topics. The Internet may help to address this need because it is a natural new source of easily available information about the beliefs or activities of a wide section of society (Burnett & Marshall, 2002; Hine, 2000; Schaap, 2004). The widespread use of large information sources like those available on the Internet is a modern phenomenon, part of the ‘informational turn’ of science (Wouters, 2000). One example of a specific social science application is the study of public science debates (e.g., Corbett & Durfee, 2004), which is pursued in this paper. In recent years there have been a few influential examples of public science debates that have reached a level where public opinion has influenced science policy. In the cases of genetically modified food (Hagendijk, 2004; Klintman, 2002) and stem-cell research (Hellsten & Leydesdorff, 2004; Leydesdorff & Hellsten, 2005, to appear), highly technical scientific issues have not been left to the scientists but have been influenced by lay opinion. From a science policy perspective, extensive debates can damage public confidence: and this confidence is now an important part of modern knowledge economies (Leydesdorff & Etzkowitz, 2003). Hence there is a need for early warning systems, perhaps using Internet-based information sources, to give policy makers time to respond rapidly to new public science concerns.

We introduce the phrase broad issue scanning to describe the task of identifying and tracking important public debates arising within a given broad issue, such as public science concerns. This differs from issue analysis (described below) in that the issue is broad enough so that the individual debates are likely to be separate and to use completely different terminology. It differs from environmental scanning (Wei & Lee, 2004) in that the focus is on public debates rather than commercial competitive intelligence, although environmental scanning is sometimes drawn upon in

1 This is a preprint of an article to be published in the Journal of the American Society for Information Science and Technology © copyright 2005 John Wiley & Sons, Inc. http://www.interscience.wiley.com/

http://www.interscience.wiley.com/

of 16

non-commercial contexts, such as government forecasting and trend detection (Cairns, Wright, Bradfield, van der Heijden, & Burt, 2004; Vander Beken, 2004).

Researchers seeking to extract information from Internet technologies using either qualitative or quantitative techniques have studied many different types, including e-mail (Koku, Nazer, & Wellman, 2001), newsgroups (Bar-Ilan, 1997; Caldas, 2003), web sites (Schaap, 2004), and blogs (Nardi, Schiano, Gumbrecht, & Swartz, 2004). This research is often direct in the sense of seeking to understand how the new technology is used, but can also be indirect: using online information to illuminate research questions that are not intrinsically technology-centred. For qualitative research, blogs are perhaps the ideal data source because they are easy to create and update and cover a wide variety of types of use (Blood, 2004; Sunstein, 2004), even if their creators are not a fully representative cross-section of society (Gill, 2004; Matheson, 2004). Blog analysis for financial gain (Smith, 2005) and competitive intelligence purposes have also been discussed (Pikas, 2005). For large-scale quantitative analysis, however, blogs are not ideal because they are highly repetitive and complex in structure, making automated analyses difficult. The Really Simple Syndication (RSS) technology, in contrast, seems ideally suited to automated processing because it allows extensive metadata and concise site descriptions (see below). Many blogs, news sources and other web sites, maintain RSS feeds that give brief descriptions of updated content on their site. RSS feeds seem set to become ubiquitous on the Internet after a slow start (Notess, 2002), with all the major browsers offering support for them by the end of 2005 (BBC, 2005). Currently (July, 2005), most web users need RSS reader software to subscribe to the feeds of relevant sites, using this as a way of automatically checking for updates. A monitoring program may poll the feeds hourly, reporting the titles of each new feed item found. If the user is interested in a title, they can click on it to launch the full report in the original web site. Automatic monitoring of RSS feeds, for example by word frequency analysis, could give useful information about public opinion. Some researchers have tried similar tasks, with a prominent example being IBM’s Web Fountain project (Gruhl, Guha, Liben-Nowell, & Tomkins, 2004). Nevertheless, no previous research has assessed the suitability of RSS feeds for the identification and tracking over time of discussions related to a broad issue or concern. As discussed below, previous issue analysis research methods have been essentially retrospective, whereas RSS feeds offer the potential, perhaps for the first time, for a real-time issue analysis, i.e. one based upon scanning current information sources.

There are two relevant research traditions for web data analysis, which will be described here as purist and pragmatic. Either could potentially be suited to broad issue scanning. The purist approach is to analyse an Internet phenomenon as it is found, seeking to describe it as accurately as possible. The pragmatic approach is to analyse a phenomenon from the perspective of attempting to gain information about an underlying phenomenon, rather than the Internet data (i.e. for indirect research). The key difference between the two approaches is that the former typically does not use any data cleansing whereas the latter tends to use extensive data cleansing. To illustrate the difference, statistical physics researchers have taken samples of web page links in order to model the web and web growth, without significant data cleansing (Barabási, 2002; Cothey, 2004; Huberman, 2001), whereas the information science tradition analysing samples of web page links in academic sites may use various types of data cleansing (alternative counting methods, excluding duplicate pages, excluding pages not authored by the web site owner) in order to better use links to infer underlying scholarly communication patterns (Björneborn, 2001; Thelwall, 2004). Another Internet example where data cleansing heuristics are necessary is web log file analysis, because of the need to filter out the activities of web crawlers in order to understand human visiting or browsing patterns (e.g., Spink, Wolfram, Jansen, & Saracevic, 2001; Wheeldon & Levene, 2003). Data cleansing is to be avoided unless strictly necessary because it is a time consuming, labour-intensive process, however (Kim, Choi, Hong, Kim, & Lee, 2003; Pyle, 1999). It must also be conducted carefully in academic research because it tends to involve human judgement and/or heuristics (e.g., Hernandez & Stolfo, 1998; Li, Zhang, & Zhang, 2003; Shahri & Barforush, 2004).

In this study, we seek to assess the suitability of RSS feeds as a data source for broad issue scanning, using public concern about science policy as a case study. We take a purist approach, avoiding significant data cleansing, despite the indirect nature of the task (as defined above). A purist approach is useful for exploratory research (e.g., Stokes, 1997) in order to gain insights into the phenomenon of RSS feeds, and also to check whether data cleansing is needed. The insights gained

of 16

can be used in future pragmatic research in order to design effective data cleansing strategies, if these are proven necessary. The objective of this preliminary research is therefore primarily descriptive: to describe RSS feeds from the perspective of identifying emerging topics relative to a broad issue. A case study approach is taken, seeking to identify emerging debates relevant to the broad issue of public science concerns.

Background and related research

Issue trackingWithin information science there are traditions for analysing the structure or information through the analysis of documents, particularly for the scientific literature. In bibliometrics, for example, scientific fields may be illustrated through diagrams, generated from citations between relevant journal articles (Small, 1999) co-authorship patterns (White & Griffith, 1982) or the co-occurrence of words in the titles of academic papers (Leydesdorff, 1989). An important trend in bibliometrics is a concern with causative factors to explain the patterns discovered, often investigated through qualitative analysis (Borgman & Furner, 2002; Cronin, 1984). Bibliometric results can fruitfully be compared over different time periods in order to study the evolution of scientific fields or broader aspects of science, such as patterns of international communication (Glänzel, 2001). In addition, there are also a number of studies that analyse the growth of scientific literature, either globally or within specific fields (Price, 1963).

Issue tracking is the task of monitoring a general issue over time. Early issue tracking showed that the growth in an issue could be retrospectively analysed through keyword tracking in publication databases. A case study of “acid rain” was used, and the tracking was conducted through academic publication databases (Lancaster & Lee, 1985). The method was successful in terms of producing interesting results, but revealed limitations, including the difficulty of identifying relevant papers before the topic terminology became universally accepted. Similar methods can now be used on a wider range of databases, so issue tracking is not restricted to academic topics and the innovative selection of databases may reveal the diffusion of ideas between different sectors of society (Wormell, 2000). A generic limitation of issue tracking based upon publication databases, however, is its retrospective nature.

The web can be conceptualised as a large database, but its disadvantage for issue tracking is that web pages can die out, so the web is not an accurate historical record. The web can be a good source for a snapshot impression of an issue (Thelwall, Vann, & Fairclough, 2006), but to be used for historical information, researchers must set up experiments to capture or monitor collections of web pages (Bar-Ilan & Peritz, 2004; Koehler, 2004; Rousseau, 1999), must accept the limitations of search engines’ memories (Leydesdorff & Curran, 2000; Hellsten, Leydesdorff, & Wouters, 2006; Wouters, Hellsten, & Leydesdorff, 2004), or must use a large scale repository, such as the Internet Archive’s (archive.org), or its Wayback Machine time series search facility. As yet, however, there is no easily accessible and reliable method of generating accurate time series data from the web without long term data collection exercises.

Note that issue tracking is different to the computer science tasks of topic identification and tracking, which are typically designed for the discovery of unknown topics and subsequently tracking them in a corpus, such as one based upon newswire feeds (Clifton, Cooley, & Rennie, 2004). The concept of a topic is much narrower than that of a broad issue. Topic identification has also been applied to search engine logs (Ozmutlu & Cavdur, 2005), giving a problem with some common characteristics. Similar techniques, time series analysis of word frequencies, are used in linguistics to identify changes in language use (e.g., Meibauer, Guttropf, & Scherer, 2004).

RSS feeds: A technical overviewThe RSS format is an XML initiative designed to exchange summary information in a compact format. Its model is the syndication of news stories provided by companies like Reuters. Although many versions of RSS are commonly in use, there are two major types that forked from version 0.9: a simple type and a more sophisticated version (Hammersley, 2005; Hammond, Hannay, & Lund, 2004). The simple type, leading to the Atom format (www.atomenabled.org), was designed for

of 16

publishing general content, such as web site and blog updates. The sophisticated variant, RSS 1.0, is extensible, using the Resource Description Framework (RDF) to allow standardised information to be published as metadata, including, but not limited to, the Dublin Core initiative (dublincore.org). This has potential semantic web applications (Karger & Quan, 2004). As an example application, the Publisher Requirements for Industry Standard Metadata (PRISM, www.prismstandard.org) initiative allows journals to use a common structured format to publish article metadata in RSS feeds (e.g., <prism:volume> for the volume number of an article). Feeds are typically automatically produced by a special purpose program or can be built into other content management programs such as blog software (e.g., blogger v5.0).

In essence, an RSS feed is an URL that returns an XML document in one of the accepted RSS formats, possibly with informal additions, containing at its heart a list of ‘items’ carrying its main content. Each item is typically a summary of a distinct piece of information, such as the following two which have been taken (and simplified) from different feeds.

<item><title>Toxic Waste Fire Evacuates 23,000 in Ark.</title><link>http://www.allheadlinenews.com/cgi-bin/news/news.cgi?

id=1104704187</link><description>AllHeadlineNews.com Sun, 2 Jan 2005 22:15:07

GMT</description></item>

<item> <title>Weather Report</title> <link>http://bunsen.tv/2004/09/weather-report.html</link> <description>People In Florida should probably find a new place to live.</description> <dc:creator>Bunsen !</dc:creator> <dc:date>2004-09-11T17:34:06Z</dc:date></item>

The list of items in an RSS feed will be periodically updated. For instance, every hour new items may be added and the oldest ones in the list removed. Hence, to check for new content, RSS monitoring software can compare the current list of items from the previous list obtained from the same feed, reporting only the newer ones.

A feature of XML is that documents can contain text that the parser is intended to ignore, flagged by the key word CDATA. In RSS feeds, HTML content can be placed inside RSS feeds via the CDATA tag. This moves away from the original intention of RSS, which was to provide only summary information, but is permitted, for example as shown in below. In principle the content could be long, including a whole web page.

<content:encoded><![CDATA[<p><b>Simple </b>example</p>]]>

</content:encoded> To illustrate a use of the CDATA field, the blog alterslash.org has offered two RSS feeds: a normal one with just metadata (http://www.alterslash.org/rss.xml) and an “extended” field that included almost the full content of the site home page in CDATA fields its embedded in a description field (http://www.alterslash.org/rss_full.xml, e.g., 8 July 2005).

Quantitative RSS and blog analysesFor research purposes, some companies and academics have developed specialist RSS monitoring systems to automatically track large numbers of RSS feeds. There is no authoritative single source for RSS feed lists, but there are many web sites that host large databases designed to allow users to search for relevant feeds. A large-scale RSS monitoring system must therefore use heuristics to

of 16

identify and select its feeds. For example, a system may monitor all RSS feeds, or restrict its list to news feeds or blog feeds.

As part of the development of blog and RSS monitoring systems, topic popularity time series have been produced and analysed, using word frequency counts (Glance, Hurst, & Tomokiyo, 2004; Gruhl et al., 2004; Kumar, Novak, Raghavan, & Tomkins, 2003). Kumar et al. (2003) analysed the link structure of blogs, crawling them directly. They observed that discussions often came in bursts of activity, with information propagating significantly between blogs. The way in which information diffuses in blogspace has attracted particular attention (Adar, Zhang, Adamic, & Lukose, 2004; Gruhl et al., 2004), particularly because blogspace is a rare example of an environment that can be monitored and can host reasonably self-contained discussions so that direct evidence of information diffusion can be collected. The research of Gruhl et al. (2004) used 11,804 RSS blog feeds (a total of 401k items) plus 14 RSS news channels. Blog topics were characterised into three general types: a reasonably steady volume of chatter; ‘spiky’ chatter with occasional externally-induced significant increases; and topics that are rarely discussed except when influenced by external events (Gruhl et al., 2004). Presumably there are also cases where there is a general trend for increasing or decreasing topic popularity, either sudden or gradual. From this research it is clear that it is possible to identify popular topics from large collections of feeds, and that these topics will have different dynamics. Of particular concern is the concept of resonance: whilst most postings attract no visible attention, some seem to strike a chord and create a recordable burst of discussion.

In addition to the private feed monitoring systems of researchers and companies, there are also some web sites that offer selected statistics from large corpuses of RSS feeds or blogs. For example www.daypop.com offered a list of the top 20 blog words (www.daypop.com/burst/, 11 July, 2005) and the top 20 news words in terms of “heightened usage” (www.daypop.com/newsburst/, 11 July, 2005), the top 40 links in blogs (www.daypop.com/top/, 11 July, 2005) and the top 100 blogs in terms of citations (links from other blogs). Another interesting list is the MIT Media lab’s link-based “most contagious information currently spreading in the weblog community” (www.blogdex.net, 11 July, 2005). Other web sites also give statistical information about blogs, typically in the form of top 100 lists (e.g., blogstreet.com). Whilst these web sites can give useful insights for researchers, they are not an optimal choice for RSS or blog research because the information they give is restricted to what the developers wish to make available and they typically do not publish the full details of their methodologies. In particular, the origins of the corpus and information about low frequency words or unpopular feeds would be difficult to get from statistical web sites, simply because it would probably not be in their commercial interests to provide information that would be of little value to most people.

Qualitative blog analysesAlthough at the time of writing there did not seem to be any qualitative research into RSS feeds, there is a huge body of qualitative blog research (e.g., Huffaker & Calvert, 2005; Matheson, 2004), which explores issues such as politics, journalism and language use. The most relevant study is that of Kutz and Herring (2005), who used repeated downloading of news web sites, once per minute. They employed content analysis to analyse how individual news stories were updated. The monitoring of sources on a minute-by-minute basis allowed them to identify interesting short-term changes such as the addition of ideology to news stories after initial fact-driven versions had been posted.

Method

Data Collection: The RSS monitor and evaluator systemA new RSS feed monitoring and processing system, the Mozhdeh RSS monitor, was constructed to gather the data used in this paper. It was based upon existing automatic methods (Gruhl et al., 2004) but with additions and modifications for the new task. The raw data for this project was a collection of 19,587 RSS feeds (almost double the number of the Gruhl paper) culled from a wide range of sources including Google searches, RSS feed sites and major online news sources. A special effort was made to identify as many science and technology-related feeds as well as personal blog feeds. Preference was given to English language feeds but non-English feeds were not excluded. The feeds

of 16

are therefore an ad-hoc collection tailored to the task. Each feed was polled on an hourly basis, recording the time of polling and all new feed items (i.e. items that were not present in the previous feed from the same source). A basic algorithm was implemented to decrease the polling frequency of infrequently updated feeds – an ethical consideration to avoid unnecessary usage of others’ computing resources.

A key difference with the Gruhl system is that the program can be primed with a Boolean expression (e.g., “dog AND (cat OR kitten)”) and will then perform all subsequent analysis on only the RSS items matching the set expression. This feature allows the identification of topics that are relevant to a given broad issue, as characterised by the chosen Boolean expression.

A second important difference is that word frequencies alone are used to generate time series for the identified postings (i.e. without natural language processing (NLP) techniques, ontologies, links or thesauri). These and statistical information content measures (e.g., Sebastiani, 2002; Yang & Pedersen, 1997) are avoided in order to obtain intuitive, fast data and to minimise the risk of missing new topics because they centre on new terms, or use language in original ways that could mislead linguistic techniques. This is a specific concern for science policy debates, which have previously been known to play with language as part of their rhetorical strategy (e.g., Hellsten, 2003).

Data was collected from November 2004, but analysed from February 5, 2005 to April 6, 2005, after the corpus had reached full size. This produced a total of 5,776,263 separate RSS feed items. Dynamic term-based indexes were implemented that catalogue for each feed item the list of words contained, the owning feed and the posting date. For each word, the identifier of containing each feed item was indexed, allowing the rapid generation of word-based time series. The index was used to automatically select the apparent ‘science debate’ postings using the method above, producing a total of 19,175 items.

AnalysisA heuristic was used to identify postings (items) that were most likely to be related to the broad issue of public science concern. Items were first identified as science-relevant if they contained one of the words {science scientist scientists scientific research researcher researchers researching researched}. They were additionally identified as relating to public science concerns if they also contained one of the following set of concern words {argue argued fear afraid worry worried concern concerned frightened scare scared risk risked risky}. The word lists were constructed by introspection and scanning postings judged to be about science debates. The collection of items containing at least one word in each of the two sets is labelled the science concern corpus. The words in this corpus are co-words in the sense of co-occurring with one ‘science’ word and one ‘concern’ word.

A time series was generated for each word occurring in the science concern corpus: for each day the proportion of science concern feed items containing the word was calculated (hereafter: relative word frequency). A debate may be characterised by an increase in the quantity of discussion around a topic and hence a logical method to identify individual science concern debates is to search for time series that significantly increase in value. Although debates may increase rapidly or slowly build up, as a practical consideration only rapidly increasing debates were sought. Four different methods for identifying words indicative of debates were assessed, as described below, and motivated by Gruhl et al. (2004). In all cases the first 80 days (whilst the RSS feed list was being built) were not tested for debates but were used in the calculation of average word frequencies.

Spike: The relative word frequency r(d) on a given day d was at least 5 times higher than the

average relative word frequency of all previous days .

Short burst: The minimum r(d,3) = min{r(d), r(d+1), r(d+2)} of the relative word frequencies on three consecutive days d, d+1, d+2 was at least 5 times higher than the average relative word frequency of all previous days.

Medium burst: The minimum r(d,5) of the relative word frequencies on five consecutive days d, d+1, d+2, d+3, d+4 was at least 5 times higher than the average relative word frequency of all previous days.

of 16

Long burst: The minimum r(d,9) of the relative word frequencies on nine consecutive days d, d+1,… d+8 was at least 5 times higher than the average relative word frequency of all previous days.

Four different methods were chosen because preliminary experiments with methods found none to be clearly successful and so it was necessary to evaluate a set of sensible choices. The minimum relative word frequency of 5 was selected heuristically: other values were tried but did not give better results.

A list of burst/spike words was identified by each of the above co-word selection methods and ordered by the size of the largest word frequency difference in the time series (i.e. r(d), r(d,3), r(d,5), or r(d,9)). The choice of ordering by absolute rather than relative word frequency difference was because many words had an effective word frequency relative increase of infinity because they were first used after the 80 day lead in period of the corpus. Two types of analysis were performed.1. For each of the four lists, the top 20 words were selected and investigated to find out why they

had a high frequency. The purpose of this was to assess the time series algorithm to see whether it was working in the intended way. The checking was performed by identifying the RSS items containing the word.

2. For the first choice method, short bursts, the top 1000 terms were checked to see whether they directly referred to a science debate. Only nouns and unknown words were checked. The checking was again performed by identifying the RSS items containing the word.

Results

Spikes (1 day)The top results were dominated by a few blog threads mainly from the alterslash.org site. In alterslash.org, contributors are allowed to post stories and others can then post comments on them, starting a thread. Active threads result in the text of early posts being highly replicated and reposted. Blogs like alterslash.org with multiple contributors also have the advantage of being more active. Table 1 is notable for the inclusion of relatively few nouns or content-loaded terms, for example “ago” and “didn’t” would give little clue as to the topic. Related to this point, it is clear that one prolific thread could generate many high frequency words from the early posts. One single RSS feed item for the shuttle story (mentioned in Table 1) is summarised below, with […] indicating sections cut out of this very long item. Each later posting to this thread included all previous postings.

Posted by Zonk (29% noise) ViewSomegeek writes “SpaceDaily.com is running a story that NASA never performed a formal risk analysis of a shuttle mission to rescue the Hubble Space Telescope […]Little to do with safety - by CaptDeuce (Score: 4, Interesting) Thread… previous NASA administrator Sean O’Keefe made the decision “based on what he perceived was the risk”. This perceived risk is in performing a manned shuttle mission that is out of range of using the International Space Station as an emergency refuge. …Loose consensus at sci.space.tech is that O’Keefe’s decision has virtually nothing to do with safety and everything to do with the extremely tight schedule necessary to complete ISS (International Space Station). […]

http://slashdot.org/comments.pl?sid=05/03/05/2020210&threshold=1&commentsort=0&mode=nested&cid=11855573

http://slashdot.org/~CaptDeuce

http://www.spacedaily.com/news/hubble-05j.html

http://www.spacedaily.com/news/hubble-05j.html

http://slashdot.org/article.pl?sid=05/03/05/2020210

of 16

Table 1. Top spike words in the science concern corpus.

r(d) Word Description Where0.376 noise Standard text in posts in prolific blogs:(percentage of noise

in post)alterslash.org

0.372 posted Standard text in posts in prolific blogs: (“posted by…”) alterslash.org0.372 thread Standard text in posts in prolific blogs alterslash.org0.369 score Standard text in posts in prolific blogs alterslash.org0.367 writes Standard text in posts in prolific blog alterslash.org0.364 provide Coincidence: occurs in the original text of three separate

topics spawning threads on the same day 24/4/05alterslash.org

0.356 view Standard text in posts in prolific blogs: “click to view” alterslash.org0.337 cowboyneal Name of blog poster: post was extensively replicated and

commented alterslash.org

0.334 months Blog thread about a new technical magazine; Linux life expectancy thread

alterslash.org (mainly)

0.333 decided Shuttle story blog thread "…decided to cancel the shuttle…" alterslash.org0.332 informative Standard text in posts in prolific blog: (post rating) alterslash.org0.328 ago Shuttle story blog thread alterslash.org0.325 instead Shuttle story blog thread alterslash.org0.325 current Shuttle story blog thread alterslash.org0.317 didn't Coincidence: occurs in several blog threads on the same

day (e.g., smoking, AIDS)e.g. deanesmay.com

0.315 production Coincidence: two blog threads used this word on the same day

alterslash.org

0.31 zonk Name of blog poster: post was extensively replicated and commented

alterslash.org

0.306 paper Nanotechnology Blog thread (“nanotechnology paper-like display”)

alterslash.org

0.3 able Several blog threads alterslash.org

Fig. 1. Science concern co-word time series for “ago” and “current”, sharing a common main spike due to both occurring in the original post for a space shuttle story thread.

Short bursts (3 days)The top results were dominated by standard text in blog threads from a few sites. Individual stories in alterslash.org did not feature since these tended to last for a maximum of one or two days. In livejournal.com, however, the feeds were very long and cumulative, rather than posting just the new content. This reposting allowed the site to dominate the results for longer periods of time, despite being less active than alterslash.org. Coincidence is also evident in Table 2 with some words being present as a result of being used in multiple unrelated threads.

of 16

Table 2. Top three day burst words in the science concern corpus.

r(d,3) Word Description Where0.199 posted Standard text in posts in prolific blogs: (“posted

by…”)alterslash.org

0.175 march Month and protest march Many0.175 noise Standard text in posts in prolific blogs:

(percentage of noise in post)alterslash.org

0.162 informative Standard text in posts in prolific blogs:('informative' post rating)

alterslash.org

0.155 thread Standard text in posts in prolific blogs alterslash.org0.152 score Standard text in posts in prolific blogs alterslash.org0.145 funny Standard text in posts in prolific blogs: (“funny”

post rating)alterslash.org

0.139 insightful Standard text in posts in prolific blogs: (“insightful” post rating)

alterslash.org

0.132 couldn't From burst of story posts Many, e.g. www.livejournal.com/users/mpoetess/

0.116 april Month Many0.114 they've Blog thread (women leaving IT) alterslash.org0.111 sounds Blog threads (three separate threads, using

different meanings of the word “sound”)alterslash.org

0.104 february Date Many0.098 successful Blog threads (several separate ones.) alterslash.org0.087 windows Multiple Microsoft threads plus stories with

glass windowsalterslash.org plus others

0.087 senate Repost “threads” (political story) twistedchick (livejournal.com) plus others

0.086 allowing Blog threads (several separate ones) alterslash.org0.085 opposed Blog threads rozk and twistedchick

(livejournal.com)0.079 hostile Blog threads rozk and twistedchick

(livejournal.com)0.075 trouble Blog threads rozk and twistedchick

(livejournal.com)

Fig. 2. A science concern co-word time series for “wikipedia”, with a burst at the end of February caused by 3 separate consecutive stories.

Figure 2 is a time series for the word wikipedia (rank 25), which was discussed in two unrelated threads (“fud-based encyclopaedias” followed by “interview with Lawrence Lessing”) over three

of 16

consecutive days at the end of February. In any large data collection, some coincidences are to be expected as statistical phenomena.

Medium bursts (5 days)The top medium bursts were dominated by alterslash.org, and by twistedchick and rozk from livejournal.com. The occurrence of prolonged reposting in livejournal.com gave the continual reappearance of old stories as well as very large RSS items.

Long bursts (9 days)The top long bursts were dominated by twistedchick and rozk from livejournal.com. The one exception in the top 20 was a prolonged thread “steps to a quieter PC” from alterslash.org with many contributors offering opinions. Even for the longer bursts, nouns were not ubiquitous. For example, the top 10 terms are: unhappy; william; deputy; hostile; remarks; america's; exam; remark; reactionary; irresponsible.

Results scanningDespite the dominance of useless words at the top of all the word lists, it was possible to manually scan the word lists and test any word that seemed promising as an indicator of a genuine debate. This method for the top 1,000 terms in the 3-day burst list produced only two genuine topic indicating words: schiavo (rank 164); and ozone (rank 353). Figure 3 illustrates the schiavo topic through a time series in the science concern data set (73 matches with few repeated thread posts) and Figure 4 gives equivalent series for the full data set (6490 matches). The term “schiavo” refers to the case of Terri Schiavo, which spawned a genuine public debate. The first science concern item illustrates the political side of the case and why it became significant: reading this in conjunction with Figure 4 provides illuminating insights into this debate.

The Terri Schiavo case has transfixed the right wing media while attracting comparatively little attention from the left. […] Michael Schiavo is Terri's legal guardian, a court has found repeatedly that Terri wouldn't want a feeding tube, and Michael asked the doctors to take the tube out. That's really all there is to it. The Terri Schiavo appeal is a vicious and well-funded propaganda campaign. Terri's parents and their allies are using pseudoscience and character assassination to destroy Michael Schiavo. The right wing is eating it up. If progressives don't counter these blatant misrepresentations now, the Terri Schiavo myths will be used against us for years to come. (http://www.pandagon.net/mtarchives/004689.html)

Note that although this item did not use any of the listed science words, it was still selected as a science story because of its science meta-tag <dc:subject>Science</dc:subject> contained within the feed item.

of 16

Figure 3. A science concern co-word time series for “schiavo” (3-day burst data).

Figure 4. A time series for the term “schiavo” in the full data set.

The comparison of Figures 3 and 4 is interesting for several reasons. First, only a hundredth of the postings were identified as science concern related. This story seems to be primarily medical and political, rather than scientific, in the sense that the discussion concerns a routine decision taken that does not relate to new technologies. Second, the evolution of the story is better seen from the larger data set. The time series is smoother as a result of the greater amount of data. Nevertheless, the Schiavo case could not have been easily identified from the full data set because it would have been surrounded by many non-science topics, so would have ranked much lower in a list of general topic bursts. It should be noted that this is an ideal case for word frequency analysis because the term schiavo apparently occurred only in connection with the debate.

DiscussionThe results are dominated by words that are not good indicators of science concern debates. Two genuine debates were identified by investigating the top 1000 words in the short burst list but it seems likely that with data cleansing the results would be significantly better. The main cause of non-useful terms was a small set of blogs with highly repetitive item generation policies, either through active threads or through reposting of old stories. In consequence, some form of data cleansing now seems unavoidable for broad issue scanning, despite the inevitable loss of some data and extra computing power requirements.

The method used, employing Boolean expressions to identify postings that are potentially relevant to a broad issue, necessarily shrinks the effective RSS corpus size, effecting data cleansing. This is clear from a comparison of the results with a similar exercise identifying science-relevant

of 16

items (but not necessarily concerns). The top words for bursts contained more nouns and terms indicative of content, although still mixed with more general terms (resort; untitled; snes; dose; christine; apr; heck; creatures; reasonable; grand; mp3; mood; hunger; insomnia; lowered; intending; handle; yard; resolve; serotonin). In fact there was a distinct medical and technological flavour to the list of top terms. It seems that with a bigger effective corpus size, individual feeds are less likely to dominate the results: data cleansing is most important for larger corpora.

There are several logical alternative options for data cleansing, some of which are described below. Further experiments are needed to assess the effectiveness of each one.

Excluding spam feeds. Some RSS feeds are Spam in the sense of being automatically generated advertising (e.g. thousands of items starting with “have you considered buying…”) and could be eliminated to improve the overall quality of the data set and to speed up the process of data analysis. Spam elimination is common in Internet applications, including search engines and email (Stitt, 2004), and hence it is a logical choice.

Including only specified fields. It would be possible to process only feeds thought to contain high quality metadata, such as the title field and perhaps also a description field. This would have the advantage of giving more concise data but would have the disadvantage that key words (e.g. science, fear) could be omitted from key fields even though they would be relevant to the post, hence reducing the overall number of broad issue-relevant items identified, which itself is a problem..

Limiting the number of words per feed item (e.g., the first or last 100 words). This would stop the domination of very large feeds but, as above, would result in less broad issue-relevant items being identified.

Automatic identification of threads in feeds, and removal of previous postings in thread feeds. This is a brute force approach and would be non-trivial computer science exercise to implement since there will be many different formats of RSS feeds with similar problems. Moreover, any algorithm would probably have to be updated as new feed formats and threading applications emerged.

Counting word frequencies by feed (per day) rather than by feed item. This is an attractive option because it would stop any word having a frequency higher than 1 per day based upon its reappearance in threads in different items but the same RSS feed. This technique would be relatively easy to implement and could be applied to all blog feeds, avoiding the need for continual software maintenance as new problematic feeds are identified. It is probably a second-best option from a data quality perspective because, ideally, it would be desirable to capture interest in a topic expressed in multiple postings within a thread, and this would stop that from being possible. This is a variant of the alternative document models previously used for link analysis (Björneborn, 2001; Henzinger, 2001; Thelwall, 2002).

LimitationsAs with any case study research, there are many limitations about the extent to which the findings can reliably be generalised. The science concerns case study was able to reveal problems that seem likely to occur for any other broad issue, however, although the extent of the data cleansing problem will probably vary by exact choice of broad issue. In particular the extent of discussion of the broad issue is critical: for larger broad issues, it would be more difficult for individual feeds to dominate the results. The following additional limitations are acknowledged. The RSS feed corpus itself was chosen in an ad-hoc manner and, given the influence of a few

individual blogs, a different corpus might not have had any anomalous blog feeds and could have given better results.

Only a limited range of options has been explored and many settings have been determined heuristically. This is common practice in complex computing systems (Gruhl et al., 2004) and unavoidable in a project with many possible variations but is still a limitation.

The modelling assumption that a step change in discussion level will signal an emerging debate rather than a gradual increase in debate has not been verified with real data.

The methodology does not address the issue of recall: the proportion of real debates that were identified. It is possible that debates were missed because they did not cause a step change or

of 16

occurred around words that were already relatively frequent (e.g., Microsoft). This is partly unavoidable, since if there were a definitive list of public mini-debates on science then the system would not be needed.

ConclusionsThis paper sought to assess whether a purist approach to RSS feeds (i.e. using the raw feeds without data cleansing) is suitable for broad issue scanning, using a co-word frequency time series approach. The domination of the results by non-useful terms in the science concern case study showed that data cleansing is necessary for efficient broad issue scanning. Raw RSS feeds are unsuitable because some feeds carry extensive and repetitive content. This is a particular concern for small broad issues that do not attract a large amount of discussion. Whilst commercial companies may be able to reduce necessary data cleansing by maintaining a very large collection of RSS feeds, smaller-scale applications do not have this option. The use of data cleansing techniques should allow future researchers to identify emergent debates with less effort, by producing tables of bursty co-words with a higher proportion of genuine topics. The Terri Schiavo case showed that useful information is available in RSS feeds, once topics have been identified, and also that our broad issue scanning method was only able to identify a fraction of postings on this topic: hence the full set of feeds should be used to investigate topics, once identified. Nevertheless, it is unlikely that a perfect system can be developed that would automatically identify emergent debates relevant to any given broad issue because of coincidences and topic ambiguity. Hence the goal of creating lists of keywords potentially indicating new debates for subsequent manual filtering seems realistic. Finally, it would be interesting to assess the extent to which natural language processing, thesaurus and ontology techniques can be employed to improve results, and whether this would be worth the performance degradation that they would probably introduce.

AcknowledgementsThe work was supported by a European Union grant for activity code NEST-2003-Path-1. It is part of the CREEN project (Critical Events in Evolving Networks, contract 012684). We thank the reviewers for their helpful comments.

ReferencesAdar, E., Zhang, L., Adamic, L., & Lukose, R. (2004). Implicit structure and the dynamics of

blogspace. Workshop on the Weblogging Ecosystem at the 13th International World Wide Web Conference, http://www.sims.berkeley.edu/~dmb/blogging.html.

Barabási, A. L. (2002). Linked: The new science of networks. Cambridge, Massachusetts: Perseus Publishing.

Bar-Ilan, J. (1997). The 'mad cow disease', Usenet newsgroups and bibliometric laws. Scientometrics, 39(1), 29-55.

Bar-Ilan, J., & Peritz, B. C. (2004). Evolution, continuity, and disappearance of documents on a specific topic on the Web: A longitudinal study of 'informetrics'. Journal of the American Society for Information Science and Technology, 55(11), 980 - 990.

BBC. (2005). Microsoft makes web feeds easier. http://news.bbc.co.uk/1/hi/technology/4621223.stm.Björneborn, L. (2001). Necessary data filtering and editing in webometric link structure analysis:

Royal School of Library and Information Science.Blood, R. (2004). How blogging software reshapes the online community. Communications of the

ACM, 47(12), 53-55.Borgman, C. L., & Furner, J. (2002). Scholarly communication and bibliometrics. Annual Review of

Information Science and Technology, 36, 3-72.Burnett, R., & Marshall, P. (2002). Web theory: An introduction. London: Routledge.Cairns, G., Wright, G., Bradfield, R., van der Heijden, K., & Burt, G. (2004). Exploring e-

government futures through the application of scenario planning. Technological Forecasting and Social Change, 71(3), 217-238.

of 16

Caldas, A. (2003). Are newsgroups extending 'invisible colleges' into the digital infrastructure of science? Economics of Innovation and New Technology, 12(1), 43-60.

Clifton, C., Cooley, R., & Rennie, J. (2004). TopCat: Data mining for topic identification in a text corpus. IEEE Transactions On Knowledge And Data Engineering, 16(8), 949-964.

Corbett, J. B., & Durfee, J. L. (2004). Testing public (un)certainty of science. Science Communication, 26(2), 129-151.

Cothey, V. (2004). Web-crawling reliability. Journal of the American Society for Information Science and Technology, 55(14), 1228-1238.

Cronin, B. (1984). The citation process: The role and significance of citations in scientific communication. London: Taylor Graham.

Gill, K. E. (2004). How can we measure the influence of the blogosphere? Paper presented at the WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics.

Glance, N. S., Hurst, M., & Tomokiyo, T. (2004). BlogPulse: Automated trend discovery for weblogs. Paper presented at the WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics.

Glänzel, W. (2001). National characteristics in international scientific co-authorship relations. Scientometrics, 51(1), 69-115.

Gruhl, D., Guha, R., Liben-Nowell, D., & Tomkins, A. (2004). Information diffusion through Blogspace. Paper presented at the WWW2004, New York, http://www.www2004.org/proceedings/docs/1p491.pdf.

Hagendijk, R. (2004). Framing GM food: Public participation and liberal democracy. EASST Review, 23(1), 3-7.

Hammersley, B. (2005). Developing feeds with RSS and Atom. Sebastopol, CA: O'Reilly.Hammond, T., Hannay, T., & Lund, B. (2004). The role of RSS in science publishing: Syndication

and annotation on the web. Dlib, 12, http://www.dlib.org/dlib/december04/hammond/12hammond.html.

Hellsten, I. (2003). Focus on metaphors: The case of "Frankenfood" on the web. Journal of Computer Mediated Communication, 8(4), http://www.ascusc.org/jcmc/vol8/issue4/hellsten.html.

Hellsten, I., & Leydesdorff, L. (2004). Measuring the meaning of words in contexts: An automated analysis of controversies about 'Monarch butterflies,' 'Frankenfoods,' and 'stem cells.' Paper presented at the Sixth Intern. Conf. on Social Science Methodology (RC33), Amsterdam, 17-20 August http://users.fmg.uva.nl/lleydesdorff/meaning/measuring%20meaning.pdf.

Hellsten, I., Leydesdorff, L., & Wouters, P. (2006, to appear). Multiple presents: How search engines re-write the past. New Media & Society. Retrieved 12 September 2005 from: http://users.fmg.uva.nl/lleydesdorff/searcheng/

Henzinger, M. R. (2001). Hyperlink analysis for the Web. IEEE Internet Computing, 5(1), 45-50.Hernandez, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the

merge/purge problem. Data Mining and Knowledge Discovery, 2(1), 9-37.Hine, C. (2000). Virtual Ethnography. London: Sage.Huberman, B. A. (2001). The Laws of the Web: Patterns in the Ecology of Information. Cambridge,

MA: The MIT Press.Huffaker, D. A., & Calvert, S. L. (2005). Gender, identity, and language use in teenage blogs. Journal

of Computer-Mediated Communication, 10(2), http://jcmc.indiana.edu/vol10/issue12/huffaker.html.

Karger, D. R., & Quan, D. (2004). What would it mean to blog on the Semantic Web? Lecture Notes in Computer Science, 3298, 214-228.

Kim, W., Choi, B. J., Hong, E. K., Kim, S. K., & Lee, D. (2003). A taxonomy of dirty data. Data Mining and Knowledge Discovery, 7(1), 81-99.

Klintman, M. (2002). The genetically modified (GM) food labelling controversy: Ideological and epistemic crossovers. Social Studies of Science, 32(1), 71-91.

Koehler, W. (2004). A longitudinal study of Web pages continued: a report after six years. Information Research, 9(2), 174.

Koku, E., Nazer, N., & Wellman, B. (2001). Netting scholars: Online and offline. American Behavioral Scientist, 44(10), 1752-1774.

of 16

Kumar, R., Novak, J., Raghavan, P., & Tomkins, A. (2003). On the bursty evolution of blogspace. Paper presented at the WWW2003, Budapest, Hungary, http://www2003.org/cdrom/papers/refereed/p477/p477-kumar/p477-kumar.htm.

Kutz, D., & Herring, S. C. (2005). Micro-longitudinal analysis of Web news updates. Proceedings of the Thirty-Eighth Hawai'i International Conference on System Sciences (HICSS-38), http://ella.slis.indiana.edu/~herring/news.pdf.

Lancaster, F. W., & Lee, J. l. (1985). Bibliometric techniques applied to issues management - a case-study. Journal of the American Society for Information Science, 36(6), 389-397.

Leydesdorff, L. (1989). Words and co-words as indicators of intellectual organization. Research Policy, 18, 209-223.

Leydesdorff, L., & Curran, M. (2000). Mapping university-industry-government relations on the Internet: the construction of indicators for a knowledge-based economy. Cybermetrics, 4, http://www.cindoc.csic.es/cybermetrics/articles/v4i1p2.html.

Leydesdorff, L., & Etzkowitz, H. (2003). Can “The Public” be considered as a fourth helix in University-Industry-Government relations? Report of the fourth triple helix conference. Science and Public Policy, 30(1), 55-61.

Leydesdorff, L., & Hellsten, I. (2005, to appear). Metaphors and diaphors in science communication: Mapping the case of ‘stem-cell research’. Science Communication, http://www.leydesdorff.net/stemcells.pdf.

Li, Y. F., Zhang, C. Q., & Zhang, S. C. (2003). Cooperative strategy for Web data mining and cleaning. Applied Artificial Intelligence, 17(5-6), 443-460.

Matheson, D. (2004). Weblogs and the epistemology of the news: Some trends in online journalism. New Media & Society, 6(4), 443-468.

Meibauer, J., Guttropf, A., & Scherer, C. (2004). Dynamic aspects of German -er-nominals: a probe into the interrelation of language change and language acquisition. Linguistics, 42(1), 155-193.

Nardi, B. A., Schiano, D. J., Gumbrecht, M., & Swartz, L. (2004). Why we blog. Communications of the ACM, 47(12), 41-46.

Notess, G. R. (2002). RSS, aggregators, and reading the blog fantastic. Online, 26(6), 52-54.Ozmutlu, S., & Cavdur, F. (2005). Neural network applications for automatic new topic

identification. Online Information Review, 29(1), 34-53.Pikas, C. K. (2005). Blog searching for competitive intelligence, brand image, and reputation

management. Online, 29(4), 16-21.Price, D.J. deSolla (1963) Little science, big science. NY, Columbia University Press. Pyle, D. (1999). Data preparation for data mining. San Francisco, CA: Morgan Kaufmann.Rousseau, R. (1999). Daily time series of common single word searches in AltaVista and

NorthernLight. Cybermetrics, 2/3, http://www.cindoc.csic.es/cybermetrics/articles/v2i1p2.html.

Schaap, F. (2004). Multimodal interactions and singular selves: Dutch weblogs and home pages in the context of everyday life. Paper presented at the AoIR 5.0, Brighton, UK.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.

Shahri, H. H., & Barforush, A. A. (2004). A flexible fuzzy expert system for fuzzy duplicate elimination in data cleaning. Lecture Notes in Computer Science, 3180, 161-170.

Small, H. (1999). Visualising science through citaiton mapping. Journal of American Society for Information Science, 50(9), 799-813.

Smith, S. (2005). Tapping the feed: In search of an RSS money trail. Econtent, 28(3), 30-34.Spink, A., Wolfram, D., Jansen, B. J., & Saracevic, T. (2001). Searching the web: The public and

their queries. Journal of American Society for Information Science, 53(2), 226-234.Stitt, R. (2004). Curbing the Spam problem. IEEE Computer, 37(12), 8.Stokes, D. E. (1997). Pascal's quadrant: Basic science and technological innovation. Washington,

D.C.: Brookings Institution.Sunstein, C. R. (2004). Democracy and filtering. Communications of the ACM, 47(12), 57-59.

of 16

Thelwall, M. (2002). Conceptualizing documentation on the Web: An evaluation of different heuristic-based models for counting links between university web sites. Journal of American Society for Information Science and Technology, 53(12), 995-1005.

Thelwall, M. (2004). Link analysis: An information science approach. San Diego: Academic Press.Thelwall, M., Vann, K., & Fairclough, R. (2006, to appear). Web issue analysis: An Integrated Water

Resource Management case study. Journal of American Society for Information Science and Technology.

Vander Beken, T. (2004). Risky business: A risk-based methodology to measure organized crime. Crime, Law and Social Change, 41(5), 471-516.

Wei, C. P., & Lee, Y. H. (2004). Event detection from online news documents for supporting environmental scanning. Decision Support Sytems, 36(4), 385-401.

Wheeldon, R., & Levene, M. (2003). The best trail algorithm for assisted navigation of Web sites. Paper presented at the 1st Latin American Web Congress (LA-WEB 2003), Sanitago, Chile.

White, H. D., & Griffith, B. C. (1982). Author co-citation: a literature measure of intellectual structure. Journal of American Society for Information Science, 32(3), 163-172.

Wormell, I. (2000). Critical aspects of the Danish Welfare State - as revealed by issue tracking. Scientometrics, 48(2), 237-250.

Wouters, P. (2000). Cyberscience: The informational turn in science. Paper presented at the Lecture at the Free University, Amsterdam.

Wouters, P. Hellsten, I., & Leydesdorff, L. (2004). Internet time and the reliability of search engines. First Monday, 9(10). Retrieved 12 September 2005 from: http://firstmonday.org/issues/issue9_10/wouters/index.html

Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th international conference on machine learning (ICML 1997) (pp. 412-420). Nashville, TN.

Are raw RSS feeds suitable for broad issue scanning? A science ...

Documents

Transcript of Are raw RSS feeds suitable for broad issue scanning? A science ...