Readability and Linguistic Subjectivity of News
-
Upload
ilias-flaounas -
Category
Data & Analytics
-
view
96 -
download
0
Transcript of Readability and Linguistic Subjectivity of News
Readability and Linguistic Subjectivity of News
Ilias Flaounas
University of Bristol
February 22, 2011
I. Flaounas (University of Bristol) February 22, 2011 1 / 21
Our research area
I. Flaounas (University of Bristol) February 22, 2011 2 / 21
Traditional Research on Media
few outlets per study (< 10)
limited numbers of news-items(few hundreds in best cases)
restricted time periods (fewdays)
news-items from a singlecountry’s media
manual annotation (‘coding’)
commercial databases – andtheir constrains
hypothesis driven research
I. Flaounas (University of Bristol) February 22, 2011 3 / 21
Research Focus
In our research we undertake a large-scale mainstream news-media textualcontent analysis using automated techniques.
I. Flaounas (University of Bristol) February 22, 2011 4 / 21
Research Focus
In our research we undertake a large-scale mainstream news-media textualcontent analysis using automated techniques.
‘Mainstream news-media’ since we do not focus on modernonline-only news spreading means such as blogs or Twitter.
I. Flaounas (University of Bristol) February 22, 2011 4 / 21
Research Focus
In our research we undertake a large-scale mainstream news-media textualcontent analysis using automated techniques.
‘Mainstream news-media’ since we do not focus on modernonline-only news spreading means such as blogs or Twitter.
‘Textual’ since we use only the textual information of news ratherthan analysing e.g. images, videos, or speech.
I. Flaounas (University of Bristol) February 22, 2011 4 / 21
Research Focus
In our research we undertake a large-scale mainstream news-media textualcontent analysis using automated techniques.
‘Mainstream news-media’ since we do not focus on modernonline-only news spreading means such as blogs or Twitter.
‘Textual’ since we use only the textual information of news ratherthan analysing e.g. images, videos, or speech.
‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.
I. Flaounas (University of Bristol) February 22, 2011 4 / 21
Research Focus
In our research we undertake a large-scale mainstream news-media textualcontent analysis using automated techniques.
‘Mainstream news-media’ since we do not focus on modernonline-only news spreading means such as blogs or Twitter.
‘Textual’ since we use only the textual information of news ratherthan analysing e.g. images, videos, or speech.
‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.
‘Automated’ in the sense that the analysis is performed by applyingArtificial Intelligence techniques rather than using human ‘coders’.
In our research data management is also a challenge!
I. Flaounas (University of Bristol) February 22, 2011 4 / 21
NOAM: News Outlets Analysis & Monitoring system
I. Flaounas, O. Ali, M. Turchi, T. Snowsill, F. Nicart, T. De Bie, N. Cristianini: “NOAM: News Outlets Analysis and Monitoring
System”, SIGMOD, Accepted for publication, 2011.
I. Flaounas (University of Bristol) February 22, 2011 5 / 21
Current Status
Our corpus in numbers:
> 1300 multilingual news sources
> 3000 news feeds
133 countries
22 languages
> 3 years of continuous monitoring
40K news items per day
> 30M news items in total
I. Flaounas (University of Bristol) February 22, 2011 6 / 21
Support Vector Machines as Topic Taggers
We trained 15 topic taggers on 5 years data from:◮ Reuters◮ NY Times
Typical text preprocessing: Stemming, stop-words removal, TF-IDF
Two-class SVMs
Cosine similarity
Empirical tuning of C parameter per tagger
We set decision threshold to get maximum F0.5-Score at the testset
I. Flaounas (University of Bristol) February 22, 2011 7 / 21
SVM Taggers
Trained on Reuters & NY Times corpora
Topic F0.5-Score F0.5 Std.Dev. Precision Recall
CRIME 78.92 1.51 82.93 66.59
DISASTERS 83.4 3.7 87.69 70.34
ELECTIONS 70.32 8.74 78.99 49.32
FASHION 83.88 18.61 94.61 71.27
INFLATION-PRICES 77.01 3.19 81.45 63.38
MARKETS 92.02 0.32 94.09 84.63
PETROLEUM 70.67 2.78 75.14 58.73
SCIENCE 73.63 5.17 83.72 50.62
SPORTS 97.78 0.5 98.31 95.75
WEATHER 71.43 3.68 82.91 46.84
ART 81.67 1.34 84.9 71.38
BUSINESS 81.16 1.19 86.23 65.87
ENVIRONMENT 64.29 4.26 73.48 43.7
POLITICS 73.81 2.29 76.65 64.81
RELIGION 74.95 4.21 83.57 53.59
I. Flaounas (University of Bristol) February 22, 2011 8 / 21
The experiment
The goal is to measure two writing style properties of news:
Linguistic Subjectivity
Readability
over different topics and outlets.
Corpus for the Experiment
10 months monitoring, (Jan. 1st, 2010 – Oct 31st, 2010)
498 English-language media
99 different countries
2.5M articles appeared in ‘Main’ feed
I. Flaounas (University of Bristol) February 22, 2011 9 / 21
Articles Annotation
We annotated 926,411 articles, with 1,037,359 tags, an average of 1.12tags per article.
Topic Articles
ART 42896 MARKETS 24319BUSINESS 126494 PETROLEUM 21236CRIME 277626 POLITICS 201776DISASTERS 83828 RELIGION 34441ELECTIONS 28656 SCIENCE 10076ENVIRONMENT 16103 SPORTS 141665FASHION 1284 WEATHER 8505INFLATION-PRICES 2331 Total 1037359
I. Flaounas (University of Bristol) February 22, 2011 10 / 21
Linguistic Subjectivity
We measure the number of sentimental adjectives over the totalnumber of adjectives per article.
We detect adjectives using Stanford POS tagger
We measure sentiment using SentiWordnet.
We characterize an adjective sentimental if either its positive ornegative sentimental score is > 0.25.
10K items per topic randomly selected
I. Flaounas (University of Bristol) February 22, 2011 11 / 21
Validation of Linguistic Subjectivity?
This is a challenge due to miss of a golden standard.
I. Flaounas (University of Bristol) February 22, 2011 12 / 21
Validation of Linguistic Subjectivity?
This is a challenge due to miss of a golden standard.But we found that:
Editorials and Opinion articles are more linguistically subjectivecompared to average.
◮ 5766 Ed/Op articles from 57 different sources.◮ LS mean value 26.15%(std.dev of 0.29%)◮ Articles in Main-feed have mean LS of 19.45% (std. dev 0.22%).
I. Flaounas (University of Bristol) February 22, 2011 12 / 21
Validation of Linguistic Subjectivity?
This is a challenge due to miss of a golden standard.But we found that:
Editorials and Opinion articles are more linguistically subjectivecompared to average.
◮ 5766 Ed/Op articles from 57 different sources.◮ LS mean value 26.15%(std.dev of 0.29%)◮ Articles in Main-feed have mean LS of 19.45% (std. dev 0.22%).
Popular articles are more linguistically subjective compared to average(Based on 108,516 popular articles in period of study)
I. Flaounas (University of Bristol) February 22, 2011 12 / 21
Validation of Linguistic Subjectivity?
This is a challenge due to miss of a golden standard.But we found that:
Editorials and Opinion articles are more linguistically subjectivecompared to average.
◮ 5766 Ed/Op articles from 57 different sources.◮ LS mean value 26.15%(std.dev of 0.29%)◮ Articles in Main-feed have mean LS of 19.45% (std. dev 0.22%).
Popular articles are more linguistically subjective compared to average(Based on 108,516 popular articles in period of study)
UK tabloids are more Linguistically Subjective compared tobroadsheets.
I. Flaounas (University of Bristol) February 22, 2011 12 / 21
Linguistic Subjectivity of Topics
0 5 10 15 20 25
POLITICSELECTIONS
BUSINESSSCIENCE
ENVIRONMENTRELIGION
PETROLEUMRANDOMSPORTSPRICES
MOST POP.WEATHERMARKETS
ARTCRIME
DISASTERSFASHION
I. Flaounas (University of Bristol) February 22, 2011 13 / 21
Readability
We measure readability based on the Flesch Reading Ease Test
FRET (article) = 206.835 − (1.015 · ASL)− 84.6 · ASW
Scores range from 0–100.
The higher the FRET the easier the text to read.
10K items per topic randomly selected
As validation we checked the readability of CBBC Newsround. It was themost readable set of articles with mean score 62.50.
I. Flaounas (University of Bristol) February 22, 2011 14 / 21
Readability of Topics
0 10 20 30 40 50
POLITICSENVIRONMENT
PRICESSCIENCE
BUSINESSELECTIONS
RELIGIONPETROLEUM
CRIMERANDOM
MARKETSMOST POP.
DISASTERSWEATHERFASHION
ARTSPORTS
I. Flaounas (University of Bristol) February 22, 2011 15 / 21
Readability vs. Linguistic Subjectivity on Topics
14 16 18 20 22 24 26 2836
38
40
42
44
46
48
50
ART
BUSINESS
ENVIRONMENT
POLITICS
RELIGION
CRIME
DISASTERS
ELECTIONS
FASHION
MARKETS
PETROLEUM
PRICESSCIENCE
SPORTS
WEATHER
Linguistic Subjectivity
Rea
dabi
lity
I. Flaounas (University of Bristol) February 22, 2011 16 / 21
Outlets
We compare for Readability and Linguistic Subjectivity of
8 US newspapers
8 UK newspapers (4 Tabloids / 4 Broadsheets)
Newspaper Articles
Chicago Tribune 5477 Daily Mail 24326
Daily News 2212 Daily Mirror 7731
Los Angeles Times 6696 Daily Star 8946
New York Post 32033 Daily Telegraph 22682
NY Times 11508 Independent 43557
The Wall Street Journal 12300 The Guardian 15393
The Washington Post 7228 The Sun 9048
USA Today 6208 The Times 2957
I. Flaounas (University of Bristol) February 22, 2011 17 / 21
Linguistic Subjectivity of Outlets
0 5 10 15 20 25 30
The Wall Str JournalThe Washington Post
USA TodayThe Times
Los Angeles TimesNY Times
Daily TelegraphThe Guardian
Chicago TribuneDaily Star
New York PostIndependent
Daily MailDaily NewsDaily Mirror
The Sun
I. Flaounas (University of Bristol) February 22, 2011 18 / 21
Readability of Outlets
0 10 20 30 40 50 60
The GuardianUSA Today
Daily MailDaily Star
The Washington PostLos Angeles Times
The Wall Str JournalDaily News
Daily TelegraphNew York Post
NY TimesThe Times
Chicago TribuneIndependentDaily Mirror
The Sun
I. Flaounas (University of Bristol) February 22, 2011 19 / 21
Readability vs. Linguistic Subjectivity on Outlets
15 20 25 3030
35
40
45
50
55
60
Chicago Tribune
Daily Mail
Daily Mirror
Daily News
Daily Star
Daily Telegraph
Independent
Los Angeles Times
New York PostNY Times
The Guardian
The Sun
The Times
The Wall Street Journal
The Washington Post
USA Today
Linguistic Subjectivity
Rea
dabi
lity
I. Flaounas (University of Bristol) February 22, 2011 20 / 21
More info and results at: http://mediapatterns.enm.bris.ac.uk
Thank you!
I. Flaounas (University of Bristol) February 22, 2011 21 / 21