Readability and Linguistic Subjectivity of News

Readability and Linguistic Subjectivity of News

Ilias Flaounas

University of Bristol

February 22, 2011

I. Flaounas (University of Bristol) February 22, 2011 1 / 21

Our research area


Traditional Research on Media

few outlets per study (< 10)

limited numbers of news-items(few hundreds in best cases)

restricted time periods (fewdays)

news-items from a singlecountry’s media

manual annotation (‘coding’)

commercial databases – andtheir constrains

hypothesis driven research


Research Focus

In our research we undertake a large-scale mainstream news-media textualcontent analysis using automated techniques.


Research Focus


‘Mainstream news-media’ since we do not focus on modernonline-only news spreading means such as blogs or Twitter.


Research Focus



‘Textual’ since we use only the textual information of news ratherthan analysing e.g. images, videos, or speech.


Research Focus




‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.


Research Focus




‘Large-scale’ since we analyse hundreds of outlets, typically forextended periods of time, involving millions of news items.

‘Automated’ in the sense that the analysis is performed by applyingArtificial Intelligence techniques rather than using human ‘coders’.

In our research data management is also a challenge!


NOAM: News Outlets Analysis & Monitoring system

I. Flaounas, O. Ali, M. Turchi, T. Snowsill, F. Nicart, T. De Bie, N. Cristianini: “NOAM: News Outlets Analysis and Monitoring

System”, SIGMOD, Accepted for publication, 2011.


Current Status

Our corpus in numbers:

> 1300 multilingual news sources

> 3000 news feeds

133 countries

22 languages

> 3 years of continuous monitoring

40K news items per day

> 30M news items in total


Support Vector Machines as Topic Taggers

We trained 15 topic taggers on 5 years data from:◮ Reuters◮ NY Times

Typical text preprocessing: Stemming, stop-words removal, TF-IDF

Two-class SVMs

Cosine similarity

Empirical tuning of C parameter per tagger

We set decision threshold to get maximum F0.5-Score at the testset


SVM Taggers

Trained on Reuters & NY Times corpora

Topic F0.5-Score F0.5 Std.Dev. Precision Recall

CRIME 78.92 1.51 82.93 66.59

DISASTERS 83.4 3.7 87.69 70.34

ELECTIONS 70.32 8.74 78.99 49.32

FASHION 83.88 18.61 94.61 71.27

INFLATION-PRICES 77.01 3.19 81.45 63.38

MARKETS 92.02 0.32 94.09 84.63

PETROLEUM 70.67 2.78 75.14 58.73

SCIENCE 73.63 5.17 83.72 50.62

SPORTS 97.78 0.5 98.31 95.75

WEATHER 71.43 3.68 82.91 46.84

ART 81.67 1.34 84.9 71.38

BUSINESS 81.16 1.19 86.23 65.87

ENVIRONMENT 64.29 4.26 73.48 43.7

POLITICS 73.81 2.29 76.65 64.81

RELIGION 74.95 4.21 83.57 53.59


The experiment

The goal is to measure two writing style properties of news:

Linguistic Subjectivity

Readability

over different topics and outlets.

Corpus for the Experiment

10 months monitoring, (Jan. 1st, 2010 – Oct 31st, 2010)

498 English-language media

99 different countries

2.5M articles appeared in ‘Main’ feed


Articles Annotation

We annotated 926,411 articles, with 1,037,359 tags, an average of 1.12tags per article.

Topic Articles

ART 42896 MARKETS 24319BUSINESS 126494 PETROLEUM 21236CRIME 277626 POLITICS 201776DISASTERS 83828 RELIGION 34441ELECTIONS 28656 SCIENCE 10076ENVIRONMENT 16103 SPORTS 141665FASHION 1284 WEATHER 8505INFLATION-PRICES 2331 Total 1037359



We measure the number of sentimental adjectives over the totalnumber of adjectives per article.

We detect adjectives using Stanford POS tagger

We measure sentiment using SentiWordnet.

We characterize an adjective sentimental if either its positive ornegative sentimental score is > 0.25.

10K items per topic randomly selected


Validation of Linguistic Subjectivity?

This is a challenge due to miss of a golden standard.



This is a challenge due to miss of a golden standard.But we found that:

Editorials and Opinion articles are more linguistically subjectivecompared to average.

◮ 5766 Ed/Op articles from 57 different sources.◮ LS mean value 26.15%(std.dev of 0.29%)◮ Articles in Main-feed have mean LS of 19.45% (std. dev 0.22%).






Popular articles are more linguistically subjective compared to average(Based on 108,516 popular articles in period of study)






Popular articles are more linguistically subjective compared to average(Based on 108,516 popular articles in period of study)

UK tabloids are more Linguistically Subjective compared tobroadsheets.


Linguistic Subjectivity of Topics

0 5 10 15 20 25

POLITICSELECTIONS

BUSINESSSCIENCE

ENVIRONMENTRELIGION

PETROLEUMRANDOMSPORTSPRICES

MOST POP.WEATHERMARKETS

ARTCRIME

DISASTERSFASHION


Readability

We measure readability based on the Flesch Reading Ease Test

FRET (article) = 206.835 − (1.015 · ASL)− 84.6 · ASW

Scores range from 0–100.

The higher the FRET the easier the text to read.

10K items per topic randomly selected

As validation we checked the readability of CBBC Newsround. It was themost readable set of articles with mean score 62.50.


Readability of Topics

0 10 20 30 40 50

POLITICSENVIRONMENT

PRICESSCIENCE

BUSINESSELECTIONS

RELIGIONPETROLEUM

CRIMERANDOM

MARKETSMOST POP.

DISASTERSWEATHERFASHION

ARTSPORTS


Readability vs. Linguistic Subjectivity on Topics

14 16 18 20 22 24 26 2836

38

40

42

44

46

48

50

ART

BUSINESS

ENVIRONMENT

POLITICS

RELIGION

CRIME

DISASTERS

ELECTIONS

FASHION

MARKETS

PETROLEUM

PRICESSCIENCE

SPORTS

WEATHER


Rea

dabi

lity


Outlets

We compare for Readability and Linguistic Subjectivity of

8 US newspapers

8 UK newspapers (4 Tabloids / 4 Broadsheets)

Newspaper Articles

Chicago Tribune 5477 Daily Mail 24326

Daily News 2212 Daily Mirror 7731

Los Angeles Times 6696 Daily Star 8946

New York Post 32033 Daily Telegraph 22682

NY Times 11508 Independent 43557

The Wall Street Journal 12300 The Guardian 15393

The Washington Post 7228 The Sun 9048

USA Today 6208 The Times 2957


Linguistic Subjectivity of Outlets

0 5 10 15 20 25 30

The Wall Str JournalThe Washington Post

USA TodayThe Times

Los Angeles TimesNY Times

Daily TelegraphThe Guardian

Chicago TribuneDaily Star

New York PostIndependent

Daily MailDaily NewsDaily Mirror

The Sun


Readability of Outlets

0 10 20 30 40 50 60

The GuardianUSA Today

Daily MailDaily Star

The Washington PostLos Angeles Times

The Wall Str JournalDaily News

Daily TelegraphNew York Post

NY TimesThe Times

Chicago TribuneIndependentDaily Mirror

The Sun


Readability vs. Linguistic Subjectivity on Outlets

15 20 25 3030

35

40

45

50

55

60

Chicago Tribune

Daily Mail

Daily Mirror

Daily News

Daily Star

Daily Telegraph

Independent

Los Angeles Times

New York PostNY Times

The Guardian

The Sun

The Times

The Wall Street Journal

The Washington Post

USA Today


Rea

dabi

lity


More info and results at: http://mediapatterns.enm.bris.ac.uk

Thank you!


http://mediapatterns.enm.bris.ac.uk

Readability and Linguistic Subjectivity of News

Data & Analytics

Transcript of Readability and Linguistic Subjectivity of News