FeedMe - a semantic RSS aggregator
Nikola Ljubešić, Damir Boras, Mislav Cimperšak, Marija Tkalec
Faculty of Humanities and Social SciencesUniversity of Zagreb
08. lipnja 2010.
Overview
1. The basic idea
2. Our system
3. Statistical analysis of collected data
4. Usage examples
08. lipnja 2010.
Overview
1. The basic idea
2. Our system
3. Statistical analysis of collected data
4. Usage examples
08. lipnja 2010.
Aggregating news
• collecting news from different information sources as publishing them as a single source
• manual and automated
• automated - problem of repeating information - need for analysis and organization
08. lipnja 2010.
Existing aggregators
• Google News
• EMM NewsExplorer
• MondoPress
08. lipnja 2010.
RSS
• RSS (Really Simple Syndication) - family of web feed formats used to publish frequently updated works
• XML file - readable by humans and machines
• RSS structured, (X)HTML nowadays still not - easier data harvesting through RSS
08. lipnja 2010.
Google Reader
• on-line RSS aggregator
• problems
• loss of information
• repeating information
• unwanted information
08. lipnja 2010.
Our idea
• collect RSS server-side - no loss of entries
• cluster RSS entries concerning their content - complex entries, no duplicates
• enable users to filter information - “affirmate” ie. “negate” specific feeds
08. lipnja 2010.
Filtering
• publish only feed entries containing n or more original feed entries
• “affirmate” feeds - publishing only feed entries containing at least one original entry of all the “affirmative” feeds
• “negate” feeds - not publish feed entries containing any of the original entries from any negated feed
08. lipnja 2010.
Overview
1. The basic idea
2. Our system
3. Statistical analysis of collected data
4. Usage examples
08. lipnja 2010.
FeedMe
• back-end - collecting RSS entries on a half an hour basis and organizing them into clusters
• front-end - web application for
• creating groups of feeds (filtering - minimum elements, affirmating, negating)
• browsing the compiled groups
• publishing groups as new RSS feeds
08. lipnja 2010.
08. lipnja 2010.
Overview
1. The basic idea
2. Our system
3. Statistical analysis of collected data
4. Usage examples
08. lipnja 2010.
The collected data
• 388 RSS feeds
• 38 different portals
• collected from 2010-05-10
• more than 100.000 entries
• cca. 30.000 clusters
08. lipnja 2010.
Distribution of documents regarding the cluster size
0
0,20
0,40
0,60
0,80
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
08. lipnja 2010.
Portals publishing on “large” events (>2)Net.hr
Monitor.hrTportal.hr
Index.hrDnevnik.hrNacional.hr
Jutarnji.hrHRT.hr
24sata.hrVecernji.hr
SlobodnaDalmacija.hrRTL.hr
0 20 40 60 80
16
19
24
27
30
45
49
54
64
66
68
77
08. lipnja 2010.
Portals publishing new stories first
Index.hrNet.hr
Monitor.hrDnevnik.hrNacional.hrTportal.hrJutarnji.hr
Vecernji.hrHRT.hr
SlobodnaDalmacija.hr24sata.hr
RTL.hr
0 50 100 150 200
31
50
51
59
62
121
122
131
143
151
161
195
08. lipnja 2010.
Portals publishing new stories first (normalized by portal size)
Tportal.hrJutarnji.hr
Net.hrHRT.hr
Vecernji.hrNacional.hrDnevnik.hrMonitor.hr
RTL.hrIndex.hr
24sata.hrSlobodnaDalmacija.hr
0 0,10 0,20 0,29 0,39
0,31
0,31
0,31
0,32
0,32
0,32
0,32
0,34
0,35
0,38
0,38
0,39
08. lipnja 2010.
Plagiates?Tportal.hr
Dnevnik.hr
Nacional.hr
Net.hr
Jutarnji.hr
Index.hr
Monitor.hr
SlobodnaDalmacija.hr
HRT.hr
0 0,08 0,15 0,23 0,30
0,01
0,01
0,01
0,01
0,02
0,03
0,06
0,09
0,24
08. lipnja 2010.
Overview
1. The basic idea
2. Our system
3. Statistical analysis of collected data
4. Usage examples
08. lipnja 2010.
Filtering by minimum number of elements
08. lipnja 2010.
Filtering by affirmating feeds
08. lipnja 2010.
Filtering by negating feeds
08. lipnja 2010.
Future steps
• user-defined RSS sources
• full-text news portals
• different sources - social networks
• topic tracking
• named entity identification
• sentiment analysis and mining
08. lipnja 2010.
Thank you! Questions?
08. lipnja 2010.
Top Related