Opinion Polarity Java short report

31
Abstract Blogs are most common medium over web where user posts their opinion. It is considered to be a web space of the users where they share their views, beliefs and other philosophy. The blogs are generally categorized of two types: Itemized blogs, where the user posts his views and opinions against a web news or news item and personal blogs where users posts random topics of their interest under the header of their choice. As more and more number of users publish their data over the web, it becomes significant that the opinion of the users be detected over various issues which help in understanding the general opinion about certain topics like the opinion of the people about certain candidates in an election. The main challenge in opinion polarity mining in the blogs is that the blogs are mainly posted in languages where users may not always use accurate and exact grammatically correct language and sometimes short form of the words and sentences are used. Moreover the opinion mining process must incorporate significant number of posts in the decision to build the decision sets. There are various approaches already proposed in this direction where authors have proposed various natural language processing tools to mine the opinion from the blog posts. The techniques are mainly eccentric around the theory of training a natural language processing machine with known opinionistic blogs and train a classifier based on this. The classifier further classifies the blogs based on their closeness with the trained datasets. A machine learning in natural language processing requires huge training data to build the decision rule and therefore classification time also increases naturally. Therefore this work proposes a unique technique of opinion polarity mining from both RSS feed and stored blog posts without using machine learning and with the help of forward scanning algorithm. The method first finds the similarity of certain blogs with a particular topic. If the blogs are closely related with a topic, the presence of opinion words and sentences are detected in the blogs. If such sentences are found, their appearance specific meaning is extracted. A scoring technique is proposed which finally extracts the polarity of the opinioninstic blog. The algorithm is tested with yahoo posts and the results shows an overall accuracy of about 70% in classifying the opinion.

Transcript of Opinion Polarity Java short report

Page 1: Opinion Polarity Java short report

Abstract

Blogs are most common medium over web where user posts their opinion. It is considered to be a web space of the users where they share their views, beliefs and other philosophy. The blogs are generally categorized of two types: Itemized blogs, where the user posts his views and opinions against a web news or news item and personal blogs where users posts random topics of their interest under the header of their choice. As more and more number of users publish their data over the web, it becomes significant that the opinion of the users be detected over various issues which help in understanding the general opinion about certain topics like the opinion of the people about certain candidates in an election. The main challenge in opinion polarity mining in the blogs is that the blogs are mainly posted in languages where users may not always use accurate and exact grammatically correct language and sometimes short form of the words and sentences are used. Moreover the opinion mining process must incorporate significant number of posts in the decision to build the decision sets. There are various approaches already proposed in this direction where authors have proposed various natural language processing tools to mine the opinion from the blog posts. The techniques are mainly eccentric around the theory of training a natural language processing machine with known opinionistic blogs and train a classifier based on this. The classifier further classifies the blogs based on their closeness with the trained datasets. A machine learning in natural language processing requires huge training data to build the decision rule and therefore classification time also increases naturally. Therefore this work proposes a unique technique of opinion polarity mining from both RSS feed and stored blog posts without using machine learning and with the help of forward scanning algorithm. The method first finds the similarity of certain blogs with a particular topic. If the blogs are closely related with a topic, the presence of opinion words and sentences are detected in the blogs. If such sentences are found, their appearance specific meaning is extracted. A scoring technique is proposed which finally extracts the polarity of the opinioninstic blog. The algorithm is tested with yahoo posts and the results shows an overall accuracy of about 70% in classifying the opinion.

Page 2: Opinion Polarity Java short report

Chapter 1

Introduction

1.1 General Introduction

1 A blog (a blend of the term web log)[1] is a type of website or part of a website. Blogs are

usually maintained by an individual with regular entries of commentary, descriptions of

events, or other material such as graphics or video. Entries are commonly displayed in

reverse-chronological order. Blog can also be used as a verb, meaning to maintain or add

content to a blog.

2 Most blogs are interactive, allowing visitors to leave comments and even message each other

via widgets on the blogs and it is this interactivity that distinguishes them from other static

websites.[2]

3 Many blogs provide commentary or news on a particular subject; others function as more

personal online diaries. A typical blog combines text, images, and links to other blogs, Web

pages, and other media related to its topic. The ability of readers to leave comments in an

interactive format is an important part of many blogs. Most blogs are primarily textual,

although some focus on art (art blog), photographs (photoblog), videos (video blogging),

music (MP3 blog), and audio (podcasting). Microblogging is another type of blogging,

featuring very short posts.

4 As of 16 February 2011, there were over 156 million public blogs in existence.[3]

The term "weblog" was coined by Jorn Barger on 17 December 1997. The short form, "blog,"

was coined by Peter Merholz, who jokingly broke the word weblog into the phrase we blog in the

sidebar of his blog Peterme.com in April or May 1999. Shortly thereafter, Evan Williams at Pyra

Labs used "blog" as both a noun and verb ("to blog," meaning "to edit one's weblog or to post to

one's weblog") and devised the term "blogger" in connection with Pyra Labs' Blogger product,

leading to the popularization of the terms.

Page 3: Opinion Polarity Java short report

Origins

Before blogging became popular, digital communities took many forms, including Usenet,

commercial online services such as GEnie, BiX and the early CompuServe, e-mail lists[9] and

Bulletin Board Systems (BBS). In the 1990s, Internet forum software, created running

conversations with "threads." Threads are topical connections between messages on a virtual

"corkboard."

The modern blog evolved from the online diary, where people would keep a running account of

their personal lives. Most such writers called themselves diarists, journalists, or journalers. Justin

Hall, who began personal blogging in 1994 while a student at Swarthmore College, is generally

recognized as one of the earliest bloggers,[10] as is Jerry Pournelle.[11] Dave Winer's Scripting

News is also credited with being one of the oldest and longest running weblogs. [12][13] Another

early blog was Wearable Wireless Webcam, an online shared diary of a person's personal life

combining text, video, and pictures transmitted live from a wearable computer and EyeTap

device to a web site in 1994. This practice of semi-automated blogging with live video together

with text was referred to as sousveillance, and such journals were also used as evidence in legal

matters.

Early blogs were simply manually updated components of common Web sites. However, the

evolution of tools to facilitate the production and maintenance of Web articles posted in reverse

chronological order made the publishing process feasible to a much larger, less technical,

population. Ultimately, this resulted in the distinct class of online publishing that produces blogs

we recognize today. For instance, the use of some sort of browser-based software is now a

typical aspect of "blogging". Blogs can be hosted by dedicated blog hosting services, or they can

be run using blog software, or on regular web hosting services.

Some early bloggers, such as The Misanthropic Bitch, who began in 1997, actually referred to

their online presence as a zine, before the term blog entered common usage.

Rise in popularity

Page 4: Opinion Polarity Java short report

After a slow start, blogging rapidly gained in popularity. Blog usage spread during 1999 and the

years following, being further popularized by the near-simultaneous arrival of the first hosted

blog tools:

Bruce Ableson launched Open Diary in October 1998, which soon grew to thousands of

online diaries. Open Diary innovated the reader comment, becoming the first blog

community where readers could add comments to other writers' blog entries.

Brad Fitzpatrick started LiveJournal in March 1999.

Andrew Smales created Pitas.com in July 1999 as an easier alternative to maintaining a

"news page" on a Web site, followed by Diaryland in September 1999, focusing more on

a personal diary community.[14]

Evan Williams and Meg Hourihan (Pyra Labs) launched blogger.com in August 1999

(purchased by Google in February 2003)

Political impact

See also: Political blog

Since 2002, blogs have gained increasing notice and coverage for their role in breaking, shaping,

and spinning news stories. The Iraq war saw bloggers taking measured and passionate points of

view that go beyond the traditional left-right divide of the political spectrum.

Page 5: Opinion Polarity Java short report

On 6 December 2002, Josh Marshall's talkingpointsmemo.com blog called attention to U.S.

Senator Lott's comments regarding Senator Thurmond. Senator Lott was eventually to resign his

Senate leadership position over the matter.

An early milestone in the rise in importance of blogs came in 2002, when many bloggers focused

on comments by U.S. Senate Majority Leader Trent Lott.[16] Senator Lott, at a party honoring

U.S. Senator Strom Thurmond, praised Senator Thurmond by suggesting that the United States

would have been better off had Thurmond been elected president. Lott's critics saw these

comments as a tacit approval of racial segregation, a policy advocated by Thurmond's 1948

presidential campaign. This view was reinforced by documents and recorded interviews dug up

by bloggers. (See Josh Marshall's Talking Points Memo.) Though Lott's comments were made at

a public event attended by the media, no major media organizations reported on his controversial

Page 6: Opinion Polarity Java short report

comments until after blogs broke the story. Blogging helped to create a political crisis that forced

Lott to step down as majority leader.

Similarly, blogs were among the driving forces behind the "Rathergate" scandal. To wit:

(television journalist) Dan Rather presented documents (on the CBS show 60 Minutes) that

conflicted with accepted accounts of President Bush's military service record. Bloggers declared

the documents to be forgeries and presented evidence and arguments in support of that view.

Consequently, CBS apologized for what it said were inadequate reporting techniques (see Little

Green Footballs). Many bloggers view this scandal as the advent of blogs' acceptance by the

mass media, both as a news source and opinion and as means of applying political pressure.

The impact of these stories gave greater credibility to blogs as a medium of news dissemination.

Though often seen as partisan gossips,[citation needed] bloggers sometimes lead the way in bringing

key information to public light, with mainstream media having to follow their lead. More often,

however, news blogs tend to react to material already published by the mainstream media.

Meanwhile, an increasing number of experts blogged, making blogs a source of in-depth

analysis. (See Daniel Drezner, J. Bradford DeLong or Brad Setser.)

Mainstream popularity

By 2004, the role of blogs became increasingly mainstream, as political consultants, news

services, and candidates began using them as tools for outreach and opinion forming. Blogging

was established by politicians and political candidates to express opinions on war and other

issues and cemented blogs' role as a news source. (See Howard Dean and Wesley Clark.) Even

politicians not actively campaigning, such as the UK's Labour Party's MP Tom Watson, began to

blog to bond with constituents.

In January 2005, Fortune magazine listed eight bloggers that business people "could not ignore":

Peter Rojas, Xeni Jardin, Ben Trott, Mena Trott, Jonathan Schwartz, Jason Goldman, Robert

Scoble, and Jason Calacanis.[17]

Israel's was among the first national governments to set up an official blog. [18] Under David

Saranga, the Israeli Ministry of Foreign Affairs became active in adopting Web 2.0 initiatives,

Page 7: Opinion Polarity Java short report

including an official video blog[18] and a political blog.[19] The Foreign Ministry also held a

microblogging press conference via Twitter about its war with Hamas, with Saranga answering

questions from the public in common text-messaging abbreviations during a live worldwide

press conference. The questions and answers were later posted on IsraelPolitik, the country's

official political blog.

The impact of blogging upon the mainstream media has also been acknowledged by

governments. In 2009, the presence of the American journalism industry had declined to the

point that several newspaper corporations were filing for bankruptcy, resulting in less direct

competition between newspapers within the same circulation area. Discussion emerged as to

whether the newspaper industry would benefit from a stimulus package by the federal

government. President Barack Obama acknowledged the emerging influence of blogging upon

society by saying "if the direction of the news is all blogosphere, all opinions, with no serious

fact-checking, no serious attempts to put stories in context, that what you will end up getting is

people shouting at each other across the void but not a lot of mutual understanding”.

Types

There are many different types of blogs, differing not only in the type of content, but also in the

way that content is delivered or written.

Personal blogs

The personal blog, an ongoing diary or commentary by an individual, is the traditional,

most common blog. Personal bloggers usually take pride in their blog posts, even if their

blog is never read. Blogs often become more than a way to just communicate; they

become a way to reflect on life, or works of art. Blogging can have a sentimental quality.

Few personal blogs rise to fame and the mainstream, but some personal blogs quickly

garner an extensive following. One type of personal blog, referred to as a microblog, is

extremely detailed and seeks to capture a moment in time. Some sites, such as Twitter,

allow bloggers to share thoughts and feelings instantaneously with friends and family,

and are much faster than emailing or writing.

Corporate and organizational blogs

Page 8: Opinion Polarity Java short report

A blog can be private, as in most cases, or it can be for business purposes. Blogs used

internally to enhance the communication and culture in a corporation or externally for

marketing, branding or public relations purposes are called corporate blogs. Similar blogs

for clubs and societies are called club blogs, group blogs, or by similar names; typical use

is to inform members and other interested parties of club and member activities.

By genre

Some blogs focus on a particular subject, such as political blogs, travel blogs (also known

as travelogs), house blogs,[23][24] fashion blogs, project blogs, education blogs, niche

blogs, classical music blogs, quizzing blogs and legal blogs (often referred to as a

blawgs) or dreamlogs. Two common types of genre blogs are art blogs and music blogs.

A blog featuring discussions especially about home and family is not uncommonly called

a mom blog.[25][26][27][28][29] While not a legitimate type of blog, one used for the sole

purpose of spamming is known as a Splog.

By media type

A blog comprising videos is called a vlog, one comprising links is called a linklog, a site

containing a portfolio of sketches is called a sketchblog or one comprising photos is

called a photoblog.[30] Blogs with shorter posts and mixed media types are called

tumblelogs. Blogs that are written on typewriters and then scanned are called typecast or

typecast blogs; see typecasting (blogging).

A rare type of blog hosted on the Gopher Protocol is known as a Phlog.

By device

Blogs can also be defined by which type of device is used to compose it. A blog written

by a mobile device like a mobile phone or PDA could be called a moblog.[31] One early

blog was Wearable Wireless Webcam, an online shared diary of a person's personal life

combining text, video, and pictures transmitted live from a wearable computer and

EyeTap device to a web site. This practice of semi-automated blogging with live video

together with text was referred to as sousveillance. Such journals have been used as

evidence in legal matters.[citation needed]

Community and cataloging

The Blogosphere

Page 9: Opinion Polarity Java short report

The collective community of all blogs is known as the blogosphere. Since all blogs are on

the internet by definition, they may be seen as interconnected and socially networked,

through blogrolls, comments, linkbacks (refbacks, trackbacks or pingbacks) and

backlinks. Discussions "in the blogosphere" are occasionally used by the media as a

gauge of public opinion on various issues. Because new, untapped communities of

bloggers can emerge in the space of a few years, Internet marketers pay close attention to

"trends in the blogosphere".[32]

BlogDay

Blogday.org[33] was created with the belief that bloggers should have one day dedicated to

getting to know other bloggers from other countries and areas of interest. The designated

date is August 31, because when written 3108, it resembles the word "Blog". On that day,

bloggers recommend five new blogs to their visitors, so that readers discover new,

previously unknown blogs.

Blog search engines

Several blog search engines are used to search blog contents, such as Bloglines,

BlogScope, and Technorati. Technorati, which is among the most popular blog search

engines, provides current information on both popular searches and tags used to

categorize blog postings.[34] The research community is working on going beyond simple

keyword search, by inventing new ways to navigate through huge amounts of information

present in the blogosphere, as demonstrated by projects like BlogScope.[citation needed]

Blogging communities and directories

Several online communities exist that connect people to blogs and bloggers to other

bloggers, including BlogCatalog and MyBlogLog.[35] Interest-specific blogging platforms

are also available. For instance, Blogster has a sizable community of political bloggers

among its members. Global Voices aggregates international bloggers, "with emphasis on

voices that are not ordinarily heard in international mainstream media."[36]

Blogging and advertising

It is common for blogs to feature advertisements either to financially benefit the blogger or to

promote the blogger's favorite causes. The popularity of blogs has also given rise to "fake blogs"

in which a company will create a fictional blog as a marketing tool to promote a product.[37]

Page 10: Opinion Polarity Java short report

1.2 Objective

Consider following two random blogs.

“I feel that for past several years congress government has not undertaken much of new

development work. Manmohan Singh had many contributions towards Indian economy but don’t

know why he is becoming ineffective.”

“BJPs development activities were good. I wish the current set of central ministers continued the

same work.”

The two sentences are reflecting the similar sentiments without any direct word wise similarity amongst

the sentences. The second sentence mentions about “Current set of Ministers” instead of government

which syntactically reflects the same thing. Moreover these reflections are over the similar subject matter.

Hence extracting a subject associated with a blog and extracting the polarity or the blogger’s opinion

about the subject matter is a challenging aspect.

The objective of the work is to develop a n engine for detecting the blogs containing user opinion about a

particular subject and further extract the three opinion scenarios : positive, negative and neutral from the

blogs. The technique takes the help of both scentatic and semantic analysis to mine the opinion and the

polarity.

1.3 Statement of the problem

The problem can be defined as to “mine the related blogs of a particular subject and mine the opinion of

the user posts in that specific subject matter.”

1.4 Methodology

Consider the following three statements.

“Even though this government has done some significant good works but their performance cannot be considered as Good”

“even though government could not materialize some key policies their performance is satisfactory.”

Page 11: Opinion Polarity Java short report

“this government has done some good work and some other issues could not be finalized.”

From this it is clear that a sentence may have some positive opinion, some negative opinion and an overall opinion. Yet some other sentences like that of 3 do not have any opinion. One thing is quite clear that for opinion polarity to be defined, first the polarity must be well defined.

We define a positive polarity as set of key words

PP={Good, Better, Best, Satisfactory, Satisfied, Well, Fantastic, Fabulous, wonderful, Great, Successful, Tremendous, Effective…}---(1)

We define a negative polarity as set of words as Bellow

NP={Bad, Poor, Undesirable, ineffective, unsatisfactory, Worst,….}---(2)

We define weight words

WW={Very, Too, High, Huge, Exceptional..}---(3)

A weight word when appears before any polarity words gives more weight to the opinion.

For example “I think Governments work is good”

“I think this governments work is exceptionally well.”

The second sentence can be considered more weighted polarity than the first one.

We also define sentence fragmentation words like

SF={‘,’,’ ;’, ’ but’, ‘ though’, ‘ still’, ‘ where as’, ‘and’}----(4)

Each of the words or delimiters in (4) has their own significance. Suppose a sentence has a ‘but’ then the second part of the sentence will have more weight. On the other hand if a though comes, the first part of the sentence will have more relevance.

Basic polarity of the sentence will depend upon overall polarity of the sentence and the polarity

of the blog depends upon average polarity of all the sentences in a blog. The sentences which do

not contain any of the words or their synonyms from (1) or (2) are considered as informative or

neutral words.

Also there are some strong polarity word like

SP={NOT, Definitely, Certainly, Never} which has some strong positive polarity term and some

negative polarity term. If words like NOT or NEVER appears before any PP or NP words in a

sentence or a part of sentence, the polarity of the sentence is changed.

Page 12: Opinion Polarity Java short report

Based on the understanding of the opinion, the polarity of a blog in a set of blogs relataed to a specific

subject matter is extracted. The algorithms are presented in detail in system design chapter.

1.5 Scope of the Work

Opinion Detection is one of the most exciting and challenging application of text analysis today.

It is the ability of recognizing and classifying opinionated text within the documents (Liu 2007).

This ability is desirable for various tasks, including filtering advertisements, separating the

arguments in online debate or discussions, ranking web documents cited as authorities on

contentious topics, etc. In Opinion Detection, one has to check whether a given text has a factual

nature (i.e. describes a given situation/event without giving any opinion about it) or expresses an

opinion on its subject matter. This task can be performed on different levels of granularity, i.e. on

word level, sentence level or on document level. As a conclusion of this task a given word,

sentence or document can be declared as of opinionated nature (or subjective) or of factual

nature (objective). Text with opinionated nature can further be analyzed for having negative or

positive polarity of opinion and this subtask is called Opinion Polarity Detection. The objective

of the work is to detect opinion polarity on a given subject amongst set of blog documents

featuring the subject.

Page 13: Opinion Polarity Java short report

System Analysis

3.1 Present System

Various polarity detection techniques are being proposed in the text as summarized in the

related work section. The main problem with most of the techniques is that they depend upon

the distance analysis and clustering result based on the occurrence of the words. The polarity

detection is purely a syntactic outcome of a sentence interpretation and many a document may

not have a clear polarity. The techniques have not proposed a clear mechanism of extracting a

polarity of a given subject. In short polarity detection is presented as an aggregation result of

distance in terms of sentences and not as a natural language processing technique. No past work

has defined finite automata for polarity detection, though numerous tree based approaches are

proposed. The present system of polarity detection technique is broadly categorized into two

categories: 1) technique based on machine learning and 2) Technique based on clustering. In 1)

A machine learning system like support vector machine is trained with known blogs with and

without opinion. Large databases are used as training sample in such techniques. The given blogs

are classified into various groups of opinionistic sentences based on various distance measure by

the classifier. The type 2) type of methods depends upon building a decision tree based on the

clustering and occurrence of interrelated words and the words that presents the various opinion

representation. Moreover the technique are tested against standard databases like Trec blog

database.

3.2 Proposed System

Page 14: Opinion Polarity Java short report

The system is modeled in two test sets. Firstly we extract the live blogs from the news feeds like

various yahoo sides. Here the subject matter is considered as the news item itself. The live blogs

are extracted and stored offline for analysis. Secondly we consider standard blogs for analysis of

the strength of the algorithm and to verify the correctness of the proposed system.

The main stages and functioning of the system is elaborated as bellow.

1) First segment the blogs into sentences and sentences into words. The words are tagged

based on wordNet tool for sentence segmentation and tagging.

2) Once the words are tagged, find the similarity of the blogs with respect to a specific

subject matter based on the tags of the blogs.

3) The similar items to the headings are ranked higher and are sorted at the top in

comparision with the other blogs.

4) The high ranked blogs are forward scanned for the deterministic words like “ I believe”,

“I think” and so on. The closeness measure with such words are performed on the high

ranked blogs and they are further categorized into blogs with opinion and blogs without

opinion.

5) Blogs related to a certain heading and that posses a opinion is now scanned for type of

opinion.

6) Based on 1.4, the sentences are weighted from the start to end based on segment

fragments as elaborated in 1.5.

7) Based on the positive, negative or zero scores the blogs are classified as positive,

negative or neutral opinion blogs.

Chapter 4

Page 15: Opinion Polarity Java short report

System Requirement

Software Requirement

IDE: Netbeans 6.3

Language: Java

Tool: OpenNLP

Dataset:

Operating System: Preferably Windows 7/ Vista ( 32/64 bit)

Hardware Requirement

Internet Connection

Minimum 2 GB RAM

Preferably Core2 duo or higher processors. (Though dual core processor can also execute the

model with less speed)

Display Resolution: minimum 1048x1048

Hard Disk Space: 30 GB for the installation and the blog database storage.

Page 16: Opinion Polarity Java short report

Chapter 5

System Design

5.1 Blog Similarity Measure

5.1.1 Data Flow Diagram

5.1.1.1 1st Level DFD

Subject

Similarity Measurement

User Blogs Blogs that are Closed to the Subject

Page 17: Opinion Polarity Java short report

5.1.1.2 2nd Level DFD

Blog Data

Description Heading

Tokens

Non- Common Words

Parts of Speech Tagging

Blog RankingSimilarity Measure

Tokenizer

Format Blog

Blog Reader from Web

Blog Reader from Database

Page 18: Opinion Polarity Java short report

5.1.1.3 Class Diagram

Page 19: Opinion Polarity Java short report

5.2 Opinion Mining from Similar Blogs

5.2.1 Block Diagram

Sentence division based on WW

PP NP

Opinion Weighting

K-Means Clustering

Polarity Detection

Similar BLOG

Document

Sentence Extraction

Page 20: Opinion Polarity Java short report

5.2.2 Flow Chart of the Process

yes

No

No

Yes

Positive Polarity, increment i

Negative Polarity,Increment i

Is TWip<TWin

TWip=sum of Positive Polarity

Twin=sum of weight of Negative Polarity

i>=S

Let total sentences=S

Wi={w1i,w2i,…wni} be the word metric of sentence either containing NP or PP, rest

of the sentences are omitted.i=0

Consider Each Independent Sentence and Sub divide it based on WW

Extract Sentences from BLOG

Start

Page 21: Opinion Polarity Java short report

Chapter 6

STOP

Obtain Aggregated Polarity

Page 22: Opinion Polarity Java short report

Conclusion

Opinion polarity is an important aspect of web mining and the web data analysis. As most of the

modern day news are debated online and the user opinions are presented online, it becomes

important for developing tools which can not only extract correlated blogs but also gets an

overview of independent and in turn generalized overview of the blogs. Many algorithms are

proposed in this direction. Most of these papers are organized to detect the opinion in the blogs

only and do not present a comprehensive overview of the entire technique of fetching the RSS

blog data and analyze them on the fly. In this work we developed an entire lifecycle of fetching

and analyzing the blogs for opinion. The technique is based on similarity of the blog with its

subject matter and the presence of opinion in such correlated blogs. The result shows a

significant similarity with human perception. The technique can be further improved by

incorporating machine learning technique with the current algorithm for better learning of the

opinions in the blogs.