Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou...

1
Automatic Detection of Tags for Political Blogs Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou [email protected] [email protected] The University of Texas at Dallas 1. Summary More than 22 .6 million Americans maintain web sites with regularly updated commentary (blogs), of which at least 38,500 are specifically dedicated to politics A tool for automatically tagging of political blog posts was introduced. Political blogs differ from other blogs as they often revolve around named entities (politicians, organizations and places). Therefore, tagging of political blog posts benefits from using basic named entity recognition to improve tagging. Tag identification using a hybrid approach (statistical and grammatical) yield better results Sood et. al report a precision/recall of 13.11%/22.83% whereas Wang and Davidson report a precision/recall of 45.25%/23.24%. Our recall is higher perhaps because of the domain. 7. Experimental Results 8. Conclusion 5. Tag Detection using Support Vector Machines Collect data from several blogs that tag data Preprocess data – Parse HTML and rectify errors Divide data into posts and index them by their tags Train the SVMs on the training data Output Input One classifier for each tag Blog URLs Training of SVM classifiers Detection of Tags Collect data from the blog Preprocess data – Parse HTML and rectify errors Divide data into posts Run all the classifiers on each post Output Input Top five tags associated with each post Blog URL Many blogs tag their posts Tags are representative of the topics discussed Training data was collected from “Daily Kos” and “Red State” 100,000 posts from Daily Kos (2003-2010) 70,000 posts from Red State (2007-2010) A total of 787,780 tags Used Joachim’s SVM Light Use the same SVM based approach with new features based on grammatical knowledge Proper Nouns are frequently topics Place a higher weight on proper and common nouns Identifying entities referred by different names Barack Obama, Obama and Barack Hussein Obama refer to the same person Fetch data from blog Preprocess data and segment into posts Perform shallow parsing Extract Noun Phrases Input Blog URL Output Top scoring nouns Extraction of Tag Nouns Fetch data from blog Preprocess data and segment into posts Perform co- reference resolution Extract entities Input Blog URL Output Top scoring entities Extraction of Tag Entities using Named Entity Recognition and Co-reference Resolution Fig. 1: Tag Detection using Support Vector Machines Fig. 2: Tag Detection using Grammatical Techniques 3. The Larger Problem Given multiple texts from two or more blogs/political sources, answer the following questions: On which subjects the texts, as a whole across each source, agree/disagree? How similar are the sources’ positions? What makes them agree/disagree? Difficult to associate an attitude with a specific topic/subject Many clues are implicit and appear to require deep semantic analysis Tags can serve as a basis for bringing together posts about the same topic Compiling a profile for each political entity: What it talks about and what its position is Tags for Political blogs are automatically detected Tags are representative of topics Significant topics are automatically identified using SVM and other NLP techniques 9. Future Work Political Profile is a summary of a political entity’s (politician, political group) stance on different issues Extract the top scoring topics along with the “entities’ sentiments” (attitudes towards topic) and select representative sentences that voice sentiments towards these topics Aggregate information across texts according to specific criteria (poster, source, time) and 2. Political Blogs 6. Tag Detection using Grammatical Techniques 4. Why are Tags Needed? Precision Recall F-Score Single Word SVM 27.30% 60.30% 37.60% + Stemming 26.10% 59.50% 36.30% + Proper Nouns 36.50% 56.80% 44.40% Named Entities 48.40% 49.10% 48.70% All Combined 21.10% 65% 31.90% Manual Scoring 67.00% 75% 70.80% Fig 3: Results on Daily Kos Precision Recall F-Score Single Word SVM 19.00% 30.00% 23.30% + Stemming 22.00% 30.20% 25.50% + Proper Nouns 46.30% 54.00% 49.90% Named Entities 60.10% 41.50% 49.10% All Combined 20.30% 65.70% 31.00% Manual Scoring 47.00% 62.00% 53.50% Fig 4: Results on Red State 2681 posts from Daily Kos and 571 posts from Red State Compared tags to original tags of blog post Manually evaluated relevance of tags on a small portion of test set

Transcript of Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou...

Page 1: Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou nisa@hlt.utdallas.eduvh@hlt.utdallas.edu The University.

Automatic Detection of Tags for Political BlogsAutomatic Detection of Tags for Political BlogsKhairun-nisa Hassanali Vasileios Hatzivassiloglou

[email protected] [email protected]

The University of Texas at Dallas

1. Summary

More than 22 .6 million Americans maintain web sites with regularly updated commentary (blogs), of which at least 38,500 are specifically dedicated to politics

A tool for automatically tagging of political blog posts was introduced.

Political blogs differ from other blogs as they often revolve around named entities (politicians, organizations and places). Therefore, tagging of political blog posts benefits from using basic named entity recognition to improve tagging.

Tag identification using a hybrid approach (statistical and grammatical) yield better results

Sood et. al report a precision/recall of 13.11%/22.83% whereas Wang and Davidson report a precision/recall of 45.25%/23.24%. Our recall is higher perhaps because of the domain.

7. Experimental Results

8. Conclusion

5. Tag Detection using Support Vector Machines

Collect data from several blogs that tag data

Preprocess data – Parse HTML

and rectify errors

Divide data into posts and index them by their

tags

Train the SVMs on the training data

OutputInput

One classifier for each tag

Blog URLs

Training of SVM classifiers

Detection of Tags

Collect data from the blog

Preprocess data – Parse HTML

and rectify errors

Divide data into posts

Run all the classifiers on each

post

OutputInput

Top five tags associated with each

post

Blog URL

Many blogs tag their posts

Tags are representative of the topics discussed

Training data was collected from “Daily Kos” and “Red State”

100,000 posts from Daily Kos (2003-2010)

70,000 posts from Red State (2007-2010)

A total of 787,780 tags

Used Joachim’s SVM Light

Use the same SVM based approach with new features based on grammatical knowledge

Proper Nouns are frequently topics

Place a higher weight on proper and common nouns

Identifying entities referred by different names

Barack Obama, Obama and Barack Hussein Obama refer to the same person

Fetch data from blog

Preprocess data and segment into

posts

Perform shallow parsing

Extract Noun Phrases

Input

Blog URL

Output

Top scoring nouns

Extraction of Tag Nouns

Fetch data from blog

Preprocess data and segment into

posts

Perform co-reference resolution

Extract entities

Input

Blog URL

Output

Top scoring entities

Extraction of Tag Entities using Named Entity Recognition and Co-reference Resolution

Fig. 1: Tag Detection using Support Vector Machines

Fig. 2: Tag Detection using Grammatical Techniques

3. The Larger Problem

Given multiple texts from two or more blogs/political sources, answer the following questions:

On which subjects the texts, as a whole across each source, agree/disagree?

How similar are the sources’ positions?

What makes them agree/disagree?

Difficult to associate an attitude with a specific topic/subject

Many clues are implicit and appear to require deep semantic analysis

Tags can serve as a basis for bringing together posts about the same topic

Compiling a profile for each political entity: What it talks about and what its position is

Organizing groups of sources according to perspective

Tags for Political blogs are automatically detected

Tags are representative of topics

Significant topics are automatically identified using SVM and other NLP techniques

9. Future WorkPolitical Profile is a summary of a political entity’s (politician, political group) stance on different issues

Extract the top scoring topics along with the “entities’ sentiments” (attitudes towards topic) and select representative sentences that voice sentiments towards these topicsAggregate information across texts according to specific criteria (poster, source, time) and quantitatively compare signatures and identify which topics are responsible for the differences

2. Political Blogs

6. Tag Detection using Grammatical Techniques

4. Why are Tags Needed?

  Precision Recall F-Score

Single Word SVM 27.30% 60.30% 37.60%

+ Stemming 26.10% 59.50% 36.30%

+ Proper Nouns 36.50% 56.80% 44.40%

Named Entities 48.40% 49.10% 48.70%

All Combined 21.10% 65% 31.90%

Manual Scoring 67.00% 75% 70.80%

Fig 3: Results on Daily Kos

  Precision Recall F-Score

Single Word SVM 19.00% 30.00% 23.30%

+ Stemming 22.00% 30.20% 25.50%

+ Proper Nouns 46.30% 54.00% 49.90%

Named Entities 60.10% 41.50% 49.10%

All Combined 20.30% 65.70% 31.00%

Manual Scoring 47.00% 62.00% 53.50%

Fig 4: Results on Red State

2681 posts from Daily Kos and 571 posts from Red State

Compared tags to original tags of blog post

Manually evaluated relevance of tags on a small portion of test set