A Method of Extracting Malicious Expressions in Bulletin Board Systems

download A Method of Extracting Malicious Expressions in Bulletin Board Systems

of 13

Transcript of A Method of Extracting Malicious Expressions in Bulletin Board Systems

  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    1/13

    A method of extracting malicious expressions in bulletin board systems

    by using context analysis

    Hiroshi Hanafusa, Kazuhiro Morita, Masao Fuketa, Jun-ichi Aoe

    Department of Information Science and Intelligent Systems, University of Tokushima, Tokushima 770-8506, Japan

    a r t i c l e i n f o

    Article history:

    Received 29 September 2009

    Received in revised form 17 August 2010

    Accepted 17 August 2010

    Keywords:

    Malicious expressions

    Bulletin board systems

    Filtering systems

    Context analysis

    Multi-attribute rules

    Separate co-occurrence expressions

    a b s t r a c t

    Bulletin board systems are well-known basic services on the Internet for information fre-

    quent exchange. The convenience of bulletin boards enables us to communicate with other

    persons and to read the communication contents at any time. However, malicious postings

    about crimes are serious problems for serving companies and users. The extracting scheme

    of the traditional methods depends on words or a sequence of words without considering

    contexts of articles and, therefore, it takes a lot of human efforts to alert malicious articles.

    In order to reduce the human efforts, this paper presents a new filtering algorithm that can

    recover the error rate of false positive for non-malicious articles by using context analysis.

    The presented scheme builds detecting knowledge by introducing multi-attribute rules. By

    the experimental results for 11,019 test data, it turns out that sensitivity and specificity of

    the presented method become 38.7 and 24.1 (%) points higher than traditional method,

    respectively.

    2010 Elsevier Ltd. All rights reserved.

    1. Introduction

    Bulletin board systems (BBS) have been used as basic services on the Internet for frequent information exchange. Repre-

    sentative examples of BBS are 2 channel h2 channeli in Japan, and Yahoo! Bulletin board hYahoo! BBSi in the world. Social

    networking services (SNS) and blog services can be considered as the applications of BBS because they provide the commu-

    nicating place in the Internet (Claypool, Brown, LE, & Waseda, 2001). Moreover, Mixi hMixii and Yokoku-In hYokoku.ini are

    well-known services in Japan. Moreover, myspace hmyspacei and Livejournal hLivejournali are representative services in the

    world. Therefore, BBS to be discussed in this paper includes the above SNS applications.

    The convenience of bulletin boards is to casually communicate with other persons due to the anonymity (Security) of the

    services and also enables us to read the communication contents any time. However, malicious postings in bulletin boards

    are serious problems for users, which are postings about mental abuse and warnings of crimes.

    Each country takes action against the mentioned problems. In America, the Childrens Internet Protection Act (CIPA) hChil-

    dren Internet Protection Acti was established in 2000, which requires public schools and libraries to use filtering software. In

    Germany, the Youth Protection Act hYouth Protection Acti was established in 2002, which requires providers to filter harmful

    content. In France, the Digital Economic Act hDigital Economic Acti was established in 2004, which requires explanation of

    filtering for accessing online communication service to the public. In Japan, the Provider Liability Act hProvider Liability Acti

    was established in May 2002: the action plan for the dissemination of and enlightenment of filtering was started in March

    2006, and the Internet Environmental Improvement Act was established in June 2008. Moreover, there were cabinet

    decisions for the comprehensive measurements for suicide established in June 2007 and the Prohibition of harm to the third

    0306-4573/$ - see front matter 2010 Elsevier Ltd. All rights reserved.doi:10.1016/j.ipm.2010.08.003

    Corresponding author.

    E-mail address: [email protected] (J.-i. Aoe).

    Information Processing and Management 47 (2011) 323335

    Contents lists available at ScienceDirect

    Information Processing and Management

    j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / i n f o p r o m a n

    http://dx.doi.org/10.1016/j.ipm.2010.08.003mailto:[email protected]://dx.doi.org/10.1016/j.ipm.2010.08.003http://www.sciencedirect.com/science/journal/03064573http://www.elsevier.com/locate/infopromanhttp://www.elsevier.com/locate/infopromanhttp://www.sciencedirect.com/science/journal/03064573http://dx.doi.org/10.1016/j.ipm.2010.08.003mailto:[email protected]://dx.doi.org/10.1016/j.ipm.2010.08.003
  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    2/13

    persons with a revised version was appended in December 2008. Considering those laws and measurements, legislative

    preparations in Japan is more advanced than other countries and there are many court cases and case examples.

    In order to solve these serious problems for serving companies and users, typical filtering schemes are using the URL fil-

    tering system hhAnichivaii which is a knowledge base representing malicious sites. This scheme can completely filter the

    sites including malicious expressions, however the problem is to build URL knowledge by human for extremely increasing

    WEB information. Therefore, it can not apply to filter the part of malicious articles in the same site.

    The useful schemes (Goldberg, Nichols, Oki, & Terry, 1992; Goldberg, Roeder, Guptra, & Perkins, 2001; Good et al., 1996;

    Heckerman, Chickering, Meek, Rounthwite, & Kadie, 2000; Herlocker, Konstan, & Riedl, 2002; Kim, Min, Jeon, Man Ro, & Han,

    2009; Lee, Lee, Chung, & An, 2007; Pennock, Horvitz, Lawrence, & Giled, 2000; Reddy, Kitsuregawa, Sreekanth, & Rao, 2002;

    Wang, Arjen, & Marcel, 2006) are automatic detection filtering systems. In these researches, content analysis ( Kim et al.,

    2009; Lee et al., 2007) was introduced for malicious domains such as pornography, drug, violence, crime and so on, but

    the basic scheme is used two steps of classification to detect harmful word filtering based on SVM and does not use context

    analysis, etc. In the methodologies, there are two types of detecting knowledge bases. One is rule-based (Francis, Frantz, &

    Mathieu 2000; Landau, Sillion, & Vichot, 1993) and another is statistic-based schemes (Gharieb, 2000; Yoohwan, Wing, Mooi,

    & Chao, 2006). Rule-based techniques can detect precise locations for expected expressions and it is easy to improve the part

    of rules, but it needs to develop practical matching engines and to keep expert persons to build knowledge. The typical sta-

    tistic-based scheme is Support Vector Machine (SVM) (Larry & Malik, 2001) which is easy to build detection knowledge by its

    strong learning technique and also to control decision engines, but it is weak to locate the expected expressions precisely,

    and to improve detecting method with knowledge partly because of its automatic learning.

    These automatic filtering techniques are supporting only systems in the current applications. Although the current tech-

    niques can detect a sequence of words, there are no practical techniques that can consider contexts of articles which have

    many disadvantages as the biggest limitation. Therefore, in the current automatic filtering systems, the rate of false positive

    (Shiraki et al., 2004; Xu, Chong, Lu, & Zhou, 2004 ) for non-malicious articles is low. Consequently, human must check a lot of

    articles in the application systems. For example, DeNA BBS (3.14 million articles per day) services need 300 persons and 2000

    million Yen for checking, but there are very critical problems because all articles cannot check as it has a lot inappropriate

    Web article even they are using automatic filtering techniques.

    In order to solve these problems, this paper presents a new context filtering algorithm to reduce the effort of human and

    to improve the rate of false positive without degrading the rate of false negative. First of all, the presented method defines

    separate co-occurrence (SC) expressions that cannot be detected by word sequence of the traditional methods. Moreover, the

    context analyses for SC expressions are proposed by introducing multi-attribute rules which are proper to extract expres-

    sions in changing Web, especially for inappropriate Web contents and a common hierarchy method. The presented method

    is estimated for 11,019 test data including malicious and non-malicious articles. It is verified that the presented method can

    improve the rate of false positive of the traditional method without degrading the rate of false negative.

    Section 2 describes the outline of the context analysis and introduces the outline of the presented system by classifying

    malicious expressions into inadequate and crime expressions. Section 3 proposes multi-attribute rules for these expressions.

    In Section 4, a context analysis algorithm is presented. Section 5 evaluates the accuracy and time performance of the pre-

    sented method from the experimental data. Section 6 presents conclusion and possible further works.

    2. The presented system

    2.1. The outline of context analysis

    Fig. 1 shows examples for malicious and non-malicious expressions, where the abbreviation RPG in (b1) means a role

    playing game.

    Texts (a1), (a2) and (a3) in Fig. 1a are malicious because underlined expressions are crime and inadequate. However, texts

    (b1), (b2), and (b3) in Fig. 1b are not malicious because bold expressions means negative expressions that deny malicious

    expressions. In (b1), malicious expressions, I have a strong sword and I kill them, can be denied by a game filed indentifiedby SC expressions, RPG. In (b2), there are negative expressions by a computer field identified by SC expressions, machine

    and processes. In (b3), there are negative expressions by attention of SC expressions Do not write malicious articles that

    can deny malicious expressions You are ugly and this woman is BBW. The difficulty of the traditional method is that it

    (a) Malicious expression texta (b) Non-malicious expression text

    (a1) I get a strong sword. Bring your company to

    Tokyo station tomorrow. Iwill kill them.

    (a2) This machine is very heavy because there are

    many muzzles. I will try to kill them soon.

    (a3) You are ugly in any jacket, always lying in the

    meeting at work and you speak about BBW.

    (b1) I have a strong sword. Bring your company in

    the next scene ofRPG tomorrow. Ikill them.

    (b2) This machine is very heavy because there are

    many processes. I will try to kill them soon.

    (b3) Do not write malicious articles in BBS. For

    example, You are ugly in any jacket or this woman is

    BBW.

    Fig. 1. Examples of malicious and non-malicious expressions.

    324 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335

    http://www.elsevier.com/locate/infopromanhttp://www.elsevier.com/locate/infopromanhttp://www.elsevier.com/locate/infopromanhttp://www.elsevier.com/locate/infopromanhttp://www.elsevier.com/locate/infopromanhttp://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    3/13

    has no scheme of detecting separate concurrency (SC) expressions for combinations by (I have a strong sword, RPG) and

    (RPG, I kill them) in (b1), (machine and processes, I will try to kill them) in (b2), and (Do not write malicious arti-

    cles, this woman is BBW) in (b3).

    In order to achieve the above solution, this paper presents a new filtering algorithm to detect SC expressions by introduc-

    ing context analysis and multi-attribute rules. Note that expressions to be detected in the presented method combines a se-

    quence of expressions based on the traditional method and SC expressions corresponding to context analysis.

    2.2. Inadequate and crime expressions

    In this paper, expressions to be detected are classified into two categories of inadequate and crime expressions.

    2.2.1. Inadequate expressions

    Inadequate expressions which people feel irritated have four main categories as follows:

    (a) hhABUSEii expressions involve violent or insulting comments towards someone or causes the psychological state of

    being annoyed by someone as follows:

    (1) You are ugly in any jacket.

    (2) This Woman is BBW.

    (3) You are always lying in the meeting at work.

    (4) Everybody in the company says you are stupid.

    (5) Are you crazy?.(b) hhDISCRIMINATIONii means treating people differently through prejudice: unfair treatment of one person or group,

    usually because of prejudice about race and ethnicity as follows:

    (1) I think she is deaf because she cant understand what I say all the time .

    (2) He is a bad man as all his talk about BBW.

    (3) Yellow monkeys cant use this room because this is for white people.

    (c) hhDATING SERVICE WEBSITEii is a dating system which allows individuals, couples and groups to make contact and to

    communicate with each other over the Internet as follows:

    (1) Im a 16-year-old girl. I can go out with guys at 3 .1

    (d) hhOBSCENTITYii means the trait of behaving in an obscene manner as follows:

    (1) I want to see you naked.

    (2) I want to buy kid porn.

    (3) He will visit that building to buy some kid porn .

    Although there are overlap expressions which consider successive postings with the same contents as troll and ungram-

    matical expressions, both are not included in this paper discussion because they have no SC expressions.

    2.3. Crime expressions

    Bulletin boards include expressions which warn about crimes. They are very important expressions to detect because

    those terribly malicious postings have the possibility to affect people and organizations seriously. As some cases actually

    happened from postings which warn crimes, those postings should not be permitted even if they are fake. There are four

    categories of expressions with warnings of crimes as follows:

    (a) hhMURDER&VIOLENCEii, as defined in common law countries, is the unlawful killing of another human being with

    intent as follows:

    (1) Your friends are immediately killed,

    (2) killed some people at Tokyo station last week.

    (b) hhEXPLOSION&ARSONii means the crime of deliberately and maliciously destroying or setting fire to structures or

    wildland as follows:

    (1) A strange boy set fire to his grandparents house last night.

    (2) A Female terrorist destroyed a big shopping mall with dynamites in Wakayama last Monday.

    (c) hhCRIME MATERIALii means the tools which are used in the crime processing as follows:

    (1) I get a strong sword, I will kill them by it.

    (2) This machine is very heavy because there are many muzzles.

    (d) hhDRUGii means a chemical substance that affects the processes of the mind or body as follows:

    (1) S crystal, high quality, 0.0002 g.

    (2) White and clear SS, high quality, ice ice ice.

    1 Where 3 means the amount of money.

    H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 325

  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    4/13

    2.4. The construction of the presented system

    Fig. 2 shows the construction of the presented system.

    In Fig. 2, Article number/Posting title/Acquisition is extracted from the bulletin board (Text Data), and then the input sen-

    tence is segmented by morpheme analysis and named entity processing. In the next step, two context phases of inadequate

    and crime expressions are carried out by each extraction rule in parallel. Finally, risk judgment is conducted according to the

    above results.

    3. Rule-based extracting knowledge

    3.1. Definition of multi-attribute rules

    For extraction of the expected expressions in natural language processing, it is important to introduce an efficient algo-

    rithm that can match multi-attribute rules by formation (morphological, syntactic and semantic). In order to build efficient

    detection rules or knowledge, the fundamental concept has proposed by Ando, Mizobuchi, Shishibori, and Aoe (1998) and it

    has been utilized for the target-based approach of sentence classification (Kadoya et al., 2005). Moreover, this approach has

    been applied to classification of medical reports (Kiyoi, Atlam, Fuketa, Yoshinari, & Aoe, 2008) and emotion analysis (Yosh-

    inari, Atlam, Morita, Kiyoi, & Aoe, 2008). Generally, these attributes include strings (words), part of speeches (categories) and

    concepts (semantic, or meanings). Suppose that A_NAME represents the attribute name and let A_VALUE represent the attri-

    bute value. Then, let R be a finite set of pairs (A_NAME, A_VALUE), then R is called a rule structure and attributes as follows:

    STR: string, or, word spelling.

    CAT: category by general concepts, or a part of speeches.

    SEM: semantic information to be defined in this paper.

    The formal definition depends on the description by Kadoya et al. (2005), but all rules correspond to inadequate and crime

    expressions. For example, by using these attributes, the input structures of the sentence He kills someone are defined as

    follows:

    N1 = {(STR, He), (CAT, HUMAN)}.

    N2 = {(STR, kills), (CAT, VERB), (SEM, MURDER&VIOLENCE)}.

    N3 = {(STR, someone), (CAT, HUMAN)}.

    InputBulletin Board (Text Data)

    Morpheme and Named Entity Analysis

    Risk judgment

    Acquisition of Posting Identification Information

    Article number/Posting title/Acquisition

    Extraction

    rules forinadequate

    ex ressions

    Morpheme

    Dictionary

    Extraction rules

    for crime

    expressions

    Crime ExpressionTest

    Inadequate ExpressionsTest

    Fig. 2. The construction of the presented systems.

    326 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335

  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    5/13

    In the above examples, (SEM, MURDER&VIOLENCE) denotes semantics for crime expressions, (CAT, HUMAN) means cat-

    egories by general concepts, and (CAT, HUMAN) and (CAT, VERB) mean part of speeches. A huge number of expressions about

    crime expressions can be represented by multi-attribute rules by using these semantics.

    The p-th multi-attribute matching rule Rule (p) is defined as follows:

    Rule p Rp1Rp2 . . . Rpm; m np;0 < np:

    Consider a rule to match the above input expression He kills someone can be defined by the Rule (1) as follows:

    Rule1 R11R12R13

    R11 fCAT;HUMANg;

    R12 fSEM;MURDER&VIOLENCEg;

    R13 fCAT;HUMANg:

    3.2. Multi-attribute descriptions

    3.2.1. Semantic information

    Semantic information (SEM) depends on Section 2, so the typical semantics are explained.

    Inadequate and crime expressions are described by combining a variety of words, phrases, categories and semantics as

    follows:

    (1) Abuse expressions: (SEM, ABUSE) use for violence or insulting comments towards someone.

    For example, abuse expressions are ugly, liar, stupid and crazy.

    (2) Discrimination expressions: (SEM, DISCRIMINATION) use for treating people differently through prejudice. For exam-

    ple, a discrimination expression is deaf.

    (3) Obscenity expressions: (SEM, OBSCENITY) use for the trait of behaving in an obscene manner.

    For example, obscenity expressions are naked and kid porn.

    (4) Murder & Violence expressions: (SEM, MURDER&VIOLENCE) use for the unlawful killing of another human being with

    intent.

    For example, murder and violence expressions are kill, and shoot.

    (5) Explosion & Arson expressions: (SEM, EXPLOSION&ARSON) use for destroying or setting fire to structures or wild land

    areas.

    For example, explosion and arson expressions are terrorists, destroy and set fire.

    (6) Crime material expressions: (SEM, CRIME MATERIAL) use for the tools which are used in the crime processing.

    For example, crime material expressions are sward and muzzles.

    3.2.2. Multi-attribute rules

    Context analysis of multi-attribute expressions (MULTI) is carried out in two stages. The first stage determines candidates

    of inadequate expressions in the text and produces results (CON, x), where CON and x represent features for context analysis

    of the second stage. The second stage determines the final results or risk judgement, by using (CON, x) and produces (FIX, y) if

    the result is fixed, otherwise (NON, y). The detailed method will be discussed in the next section.

    Tables 1 and 2 show rule-based knowledge for the first and the second stages, respectively.

    Table 1 uses general concepts such as CLOTHES, JOB, DOCUMENT. (CON, NEGATION)) neglects inadequate and crime deci-

    sions. NEGATION) can be also performed by the concepts denoting special fields (CON, GAME) and (CON, COMPUTER). In Ta-

    ble 1, Rule (8) can match the input crime expression He kills someone and produces (CON, MURDER&VIOLENCE) for

    context analysis by the second stage.

    In Table 2, Rule (18) = {{(CON, CRIME MATERIAL)}{(CON, PLACE)} {(CON, TIME)} {(CON, MURDER&VIOLENCE)}} is the

    decision rule for hhCRIME MATERIALii and hhCRIME MATERIALii, where hhii corresponds to types of inadequate and crime

    classes in Section 2. This rule takes the features for context analysis of the first stage ofTable 1 and produces the final judg-

    ment of (FIX, hhCRIME MATERIALii, hhMURDER&VIOLENCEii), where this notation means (FIX, hhCRIME MATERIALii) and (FIX,

    hhMURDER&VIOLENCEii). If the final judgment is not FIX, then the output becomes (NON, hCRIME MATERIALii), (NON,

    hhMURDER&VIOLENCEii) and (NON, hhABUSEii) as in Table 2.

    4. Multi-attribute matching

    4.1. Construction of machines

    For multi-attribute matching, Ando et al. (1998) has proposed a set matching algorithm and the implementation is devel-

    oped in Cprogramming language. Kadoya et al. (2005) used this approach for sentence classification and Kiyoi et al. (2008)

    used this approach for medical reports and Yoshinari et al. (2008) used it for emotion analysis.

    H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 327

    http://-/?-http://-/?-http://-/?-http://-/?-
  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    6/13

    Suppose that R be a sequence of the input structures. The machine multi-attribute pattern-matching (MAPM) machine in

    this method takes R as the input and produces matching results as the output corresponding to the rules. Formally, the ma-

    chine MAPM consists of a set of states and each state is represented by a number. The matching operation of the machine

    MAPM is similar to the multi-keyword string pattern-matching method of AhoCorasick (Aho & Corasick, 1975; Ando et al.,

    1998), but it has the following distinctive features:

    4.1.1. goto and output functions

    Let Tbe a set of states and let L be a set of the rule structure R, then the behaviour of the machine MAPM is defined by next

    two functions:

    (a) goto function goto: T

    L? T[

    {fail} where the function goto maps a set of consisting of a state and a rule structure

    into a state or the message fail. A transition label of the goto function is extended to a set of notation. Therefore, in

    the machine MAPM, a confirming transition is decided by the inclusion relationship whether the input structure N

    includes the rule structure R or not.

    (b) output function: T?A where A is a set of pair, (p,(x, y)), for rule number p and for matching results (x, y). For Rule (1)

    in Table 1, the matching result becomes (CON, ABUSE) and then {(1,(CON, ABUSE))} is the proper representation.

    The input structures to be matched by the matching rule are also defined by the same set of representation. Nis used as

    the notation for input structures to distinguish them from R. In order to consider the abstraction of the rule structure, match-

    ing of the rule structure R and the input structure Nare decided by the inclusion relationship such that Nincludes R, denoted

    by N R. Therefore, the machine MAPM is also called a set matching machine.

    Let R_SET be a set of Rule (p) for inadequate and crime expressions. Consider the following Rule (7) and Rule (8) in Table 1.

    R71= {(SEM, MURDER&VIOLENCE)}},

    R72= {(CAT, HUMAN)}}.

    R81 = {(CAT, HUMAN)}, R82 = {(SEM, MURDER&VIOLENCE)}, R83 = {(CAT, HUMAN)}}.

    Table 2Examples of decision knowledge of the second stage (context analysis).

    output Multi-attribute rules Examples

    (FIX, hhCRIME MATERIALii,

    hhMURDER&VIOLENCEii)

    Rule (18) = {{(CON, CRIME MATERIAL)}{(CON, PLACE)}

    {(CON, TIME)} {(CON, MURDER&VIOLENCE)}}

    Fig. 1 (a1) I get a strong sword. Bring your company to Tokyo

    station tomorrow. I will kill them

    (NON, hhCRIME MATERIALii) Rule (19) = {{(CON, CRIME MATERIAL)}{{(CON, GAME)}} Fig. 1 (b1) I get a strong sword. Bring your company to play

    RPG tomorrow. I will kill them.

    (FIX, hhCRIME MATERIALii,

    hhMURDER&VIOLENCEii)

    Rule (20) = {{(CON, CRIME MATERIAL)}{{(CON,

    MURDER&VIOLENCE)}}

    Fig. 1 (a2) This machine has many muzzles. I will try to kill

    them soon.

    (NON,

    hhMURDER&VIOLENCEii)

    Rule (21) = {(CON, COMPUTER)} {(CON,

    MURDER&VIOLENCE)}}

    Fig. 1 (b2) This machine has many processes. I will try to kill

    them soon.

    (FIX, hhABUSEii) Rule (22) = {{(CON, ABUSE)}{(CON, ABUSE)}} Fig. 1 (a3) You are ugly in any jacket, always lying in the

    meeting at work and you speak about BBW.

    (NON, hhABUSEii) Rule (23) = {{(CON, NEGATION)}{(CON, ABUSE)}} Fig. 1 (b3) Do not write malicious articles in BBS. For

    example, You are ugly in any jacket or this woman is BBW.

    Table 1

    Examples of extracting knowledge of the first stage.

    output Multi-attribute rules Examples

    (CON, ABUSE) Rule (1) = {{(SEM, ABUSE)}{(CAT, CLOTHES)}} (1) ugly in any Jacket

    Rule (2) = {{(SEM, ABUSE)}{(CAT, JOB)}} (2) liar at work

    Rule (3) = {{{(SEM, ABUSE)} {(CAT, DOCUMENT)}} (3)malicious articles

    (CON, DISCRIMINATION) Rule (4) = {{(CAT, HUMAN)} {(CAT, VERB)}{(SEM, DISCRIMINATION)}} (4)she is deaf

    (CON, OBSCENITY) Rule (5) = {{(CAT, VERB)} {(CAT, HUMAN)} {(SEM, OBS CENITY)}} (5)see you naked

    Rule (6) = {{(CAT, HUMAN)} {(CAT, VERB)} {(SEM, OBSCENITY)}} (6) I buy kid porn(CON, MURDER&VIOLENCE) Rule (7) = {{(SEM, MURDER&VIOLENCE)} {(CAT, HUMAN)}} (7) kill people

    Rule (8) = {{(CAT, HUMAN)} {(SEM, MURDER&VIOLENCE)} {(CAT, HUMAN)}} (8) He kills someone

    (CON, EXPLOSION&ARSON) Rule (9) = {{(CAT, HUMAN)} {(SEM, EXPLOSION&ARSON)}} (9) strange boy set fire

    Rule (10) = {{(SEM, EXPLOSION&ARSON)}{(CAT, ORGANIZATION)}} (10) destroy a high school

    (CON, CRIME MATERIAL) Rule (11) = {{(CAT, HUMAN)} {(CAT, VERB)} {(SEM, CRIME MATERIAL)}} (11) I get a strong sword

    Rule (12) = {{(CAT, MACHINE)}{(SEM, CRIME MATERIAL)}} (12) machine has many muzzles

    (CON, GAME) Rule (13) = {{(CAT, VERB)} {(CAT, GAME)}} (13) play RPG

    (CON, COMPUTER) Rule (14) = {{(CAT, MACHINE)} {(CAT, VERB)} {(CAT, PROCESS )}} (14) machine has many processes

    (CON, NEGATION) Rule (15) = {{(SEM, NEGATIVE)}} (15) Do not write malicious articles

    (CON, PLACE) Rule (16) = {{(CAT, NAME)} {(CAT, STATION)}} (16) Tokyo station

    (CON, TIME) Rule (17) = {{(CAT, TIME)} (17) tomorrow

    328 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335

  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    7/13

    Suppose that the input he kills someone has the following structures.

    N1 = {(STR, he), (CAT, HUMAN)}.

    N2 = {(STR, kills), (CAT, VERB}, {(SEM, MURDER&VIOLENCE)}.

    N3 = {(STR, someone)}, (CAT, NOUN), (CAT, HUMAN)}.

    Each input structure can include the corresponding rule structure as follows:

    N1 R71, N2 R72.

    N1 R81, N2 R82, N3 R83.

    The machine MAPM becomes non-deterministic if there are two more rules that can match the input structure. The ambi-

    guity can be solved by selecting the longest applicable rules with high priority.

    Figs. 3 and 4 show goto and output functions for Tables 1 and 2, respectively. In these figures Rules 10, 15, 16, 17 and 19

    are neglected as we used some samples only.

    4.2. Multi-stage matching

    The context analysis of MULTI is carried out by two stages, where the rules ofTable 1 are used for the first stage matching

    and those of Table 2 are used by the second stage.

    The following procedure summarizes the behaviour of the machine MAPM as the procedure MAPM(a, M) that can carryout context analysis of the proposed method MULTI by calling this procedure MAPM(a, M) twice (Kiyoi et al., 2008).

    4.2.1. Procedure MAPM(a, M)

    A sequence a of input structures is N1, N2, . . ., Nn where each Ni (0 < i < n + 1) is an input structure. Mis a machine MAPM

    defined by goto and output functions. Note that the input of the first state becomes results of named entity processing and

    the second stage is the sequence of outputs with the notation (CON, x) of the first stage matching. The function NEXT(a) re-

    turns the first structure N1 and modifies a = N2. . ..Nn, where the first structure N1 is removed.

    * (CAT,CLOTHES), (CAT,JOB) and (CAT,DOCUMENT) are merged into the same transition and output(16) merges rules 1,2 and 3.

    {(CAT, HUMAN)}

    {(SEM, EXPLOSION)}

    {(SEM, MURDER&VIOLENCE)}

    {(CAT, VERB)}

    {(CAT, HUMAN)}

    output(4) = {(8,(CON,(MURDER&VIOLENCE)))}

    output(8) ={(11, (CON, CRIME MATERIAL))}

    1

    2

    3 4

    5{(SEM, DISCRIMINATION)}

    output(7) = {(6, (CON, OBSCENITY))}7

    output(6) = {(4, (CON, DISCRIMINATION))}

    {(SEM, OBSCENITY)}

    output(13) = {(13, (CON, GAME))}{(CAT, GAME)}

    output(10)= {(9, (CON, EXPLOSION&ARSON))}

    14

    18

    11

    12

    {(CAT, VERB)}

    {(CAT, HUMAN)}

    15{(SEM, ABUSE)} {(CAT, CLOTHES, JOB, DOCUMENT)}*

    17 19

    {(CAT, MACHINE)} {(CAT, VERB)}

    20 output(20)= {(14, (CON, COMPUTER))}

    21 22{(SEM, MURDER&VIOLENCE)} {(CAT, HUMAN)}

    {(CAT, PROCESS)}

    output(22) = {(7, (CON,MURDER&VIOLENCE))}

    output(18)= {(12 , (CON, CRIME MATERIAL))}

    {(SEM, CRIME MATERIAL)}

    {(SEM, CRIM MATERIAL)}

    output(14) ={(5, (CON, OBSCENITY))}{(SEM, OBSCENITY)}

    output(16)={(1, 2,3, (CON, ABUSE))}

    8

    6

    13

    10

    16

    Fig. 3. The goto and output functions for some sample rules in Table 1.

    H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 329

  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    8/13

    beginSTATE:= 0;

    while a NULL do

    begin

    N= NEXT (a);

    while STATE fail do STATE:= goto (STATE, R) such that N R;

    MAPM (a, M);

    Output:= output (p) for matched Rule (p);

    N:= NEXT (a);

    end

    end

    Consider the input sentence I get a strong sward in (a1) of Fig. 1 with the following sequence of structures.

    N1 = {(STR, I), (CAT, HUMAN)}.

    N2 = {(STR, get), (CAT, VERB)}.

    N3 = {(STR, a strong sward), (CAT, NOUN), (SEM, CRIME MATERIAL)}.

    Table 3 shows the matching flow of the first stage for the above input structures. State transitions are 1, 2, 5, 8, and then

    state 8 produces output (8) = 7 identifying {(CON, CRIME MATERIAL)} which becomes the input of the second stage

    matching.

    Table 4 shows the matching flow of the second stage for (a1) in Fig. 1. Suppose that the following results are obtained

    from the first stage.

    I get a strong sword.is N1 = {(CON, CRIME MATERIAL)}.

    Bring your company to Tokyo station tomorrow. is N2 = {(CON, PLACE)} and N3 = {(CON, TIME)}, I will kill them is

    N4 = {(CON, MURDER&VIOLENCE)}

    State transitions are 1, 4, 5, 6, 7, and, then, output (7) produces that the final decision of (a1) is hhCRIME MATERIALii and

    hhMURDER&VIOLENCEii.

    In the same manner, Table 5 shows the matching flow of the second stage for the part of (b1) in Fig. 1. Suppose that the

    following results are obtained from the first stage.

    1

    2{(CON, ABUSE)}

    3{(CON, ABUSE)}

    output (3) = {(22, (FIX,))}

    4{(CON, PLACE)}

    5

    9

    {(CON, NEGATION)}

    {(CON, ABUSE)}

    output (10) = {(23, (NON,))}10

    {(CON, MURDER&VIOLENCE)}

    8

    {(CON, COMPUTER)}

    12{(CON, MURDER&VIOLENCE)}

    output (12) = {(21, (NON, >))}

    {(CON, CRIME MATERIAL)}

    7

    {(CON, MURDER&VIOLENCE)}

    6{(CON, TIME)}

    output(8)={(20,(FIX,,>))}

    11

    output(7)={(18,(,>))}

    Fig. 4. The goto and output functions for rules in Table 2.

    Table 3

    Examples of matching process in the first stage.

    STATE N R goto/output

    1 N1 {(CAT, HUMAN)} 2

    2 N2 {(CAT, VERB)} 5

    5 N3 {(SEM, CRIME MATERIAL)} output (8) = {(11, (CON, CRIME MATERIAL))}

    330 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335

  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    9/13

    Bring your company in the next scene of RPGtomorrow is N1 = {(CON, GAME)}.

    I kill them is N2 = {(CON, MURDER&VIOLENCE)}.

    State transitions are 1, 11 and 12. Then, output (12) produces that the final decision is not hhMURDER&VIOLENCEii.

    5. Experimental results

    5.1. Basic detection knowledge and experimental data

    Basic detection knowledge has been built to detect expressions for abuse, obscenity, drug, and crime. Table 6 shows the

    contents of detection knowledge for each expression, where the following abbreviations are utilized.

    NUM-WORD: The number of word expressions.

    NUM-RULE: The number of multi-attribute rules.

    NUM-PAT: The number of surface patterns.

    Experimental data have been collected from 22 bulletin boards and the number of articles with possibility of inappropri-

    ate expressions is 8450. For main sites of 2 channel h2 channeli, Yahoo! hYahoo! BBSi, Gakkou-Ura hGakkou-Urai and Yoko-

    ku-In hYokoku-Ini, fields of the above articles include criticism, requests, hospitals, educational problems, cartoons, betrayal,dirt, comics, rumors, arrest, notices and crimes. 1525 inadequate (I) and 388 crime (C) expressions to be estimated has been

    obtained from 8450 malicious articles. For non-inadequate (NI) and non-crime (NC) test data, 2569 non-malicious expres-

    sions (1277 non-inadequate (NI) and 1382 crime (NC) expressions) have been prepared from Web pages, like Fig. 1 (b),

    including basic single words such as kill, sword, sex, adults and so on. That is to say, test data I and C are malicious

    data, but NI and NC are non-malicious data.

    5.2. Experimental results

    The presented method based on multi-attribute expressions in context is called as MULTI and the traditional method

    based on sequential of morphemes or words is called as SINGLE against to MULTI. To estimate MULTI and SINGLE, specificity

    and sensitivity (Altman & Bland, 1994) are used as follows:

    True positive (TP): Malicious expressions correctly determined as malicious.False positive (FP): Non-malicious expressions incorrectly determined as malicious.

    True negative (TN): Non-malicious expressions correctly determined as non-malicious.

    False negative (FN): Malicious expressions incorrectly determined as non-malicious.

    Let NUM_TP, NUM_FP, NUM_TN and NUM_FN be the numbers of TP, FP, TN and FN, respectively.

    SPECIFICITY: The rate (%) of specificity by calculating NUM_TN/(NUM_TN + NUM_FP).

    SENSITIVITY: The rate (%) of sensitivity by calculating NUM_TP/(NUM_TP + NUM_FN).

    Table 4

    Examples of matching process in second stage.

    STATE N R goto/output

    1 N1 {(CON, CRIME MATERIAL)} 4

    4 N2 {(CON, PLACE)} 5

    5 N3 {(CON, TIME)} 6

    6 N4 {(CON, MURDER&VIOLENCE)} output (7) = {(18, (FIX, hhCRIME MATERIALii, hhMURDER&VIOLENCEii))}

    Table 5

    Examples of matching process in second stage.

    STATE N R goto/output

    1 N1 {(CON, GAME)} 11

    11 N2 {(CON, MURDER&VIOLENCE)} output (12) = {(21, (NON, hhMURDER&VIOLENCEii))}

    Table 6

    The contents of basic detection knowledge.

    NUM-WORD NUM-RULE NUM-PAT

    Inadequate 16,239 1281 12,875,138

    Crime 12,681 1378 9486,523

    Total 28,920 2659 22,361,661

    H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 331

  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    10/13

    A specificity of 100% means that the all non-malicious expressions will be detected as non-malicious expressions. A sen-

    sitivity of 100% means that the all malicious expressions will be detected as malicious expressions. We can say that the pre-

    sented method has low error rate and can reduce human efforts for a large number of malicious candidates.

    Other preparations for malicious test data (I and C) are explained as follows:

    ALL_DATA: The number of all data to be estimated.

    ALL_CORR: The number of all correct expressions to be extracted.

    NUM_EXTR: The number of the extracted expressions.

    Table 7 shows experimental results for malicious expressions, where SINGLE(I) and SINGLE(C) represent SINGLE for I and

    C, respectively. MULTI(I) and MULTI(C) are the same meaning.

    Table 8 shows experimental results for non-malicious test data (NI and NC), where SINGLE(NI) and SINGLE(NC) represent

    SINGLE for NI and NC, respectively. MULTI(NI) and MULTI(NC) are the same meaning.

    Table 9 shows evaluation results by SPECIFICITY and SENSITIVITY obtained from TP, FN, TN and FP in Tables 7 and 8.

    From simulation results in Table 9, it turns out that SPECIFICITY and SENSITIVITY of MULTI can be improved by 38.7 and

    24.1 (%) points for those of SINGLE, respectively. The main reason is that the error rate by false positive (FP) of MULTI is very

    lower than that of SINGLE in Table 8. High TP and TN of MULTI are also related to the improvement. In general, the number of

    malicious expressions is extremely larger than that of non-malicious expressions. Therefore, the presented method contrib-

    utes reduction of human efforts.

    Moreover, the presented rule-based method has two efficient advantages as follows:

    (a) Unknown words: The presented rule-based method is proper to extract expressions in changing Web, especially for

    inappropriate Web contents. The reason is based on the set matching ability such that N R. Suppose that N has

    (CAT, UNKNOWN) when the input has unknown expressions. Then, R is replaced by {(CAT, UNKNOWN)} and N R

    is confirmed.

    Table 7

    Simulation results for inadequate (I) and crime (C) expressions.

    ALL_CORR NUM_EXTR NUM_TP NUM_FN TP (%) FN (%)

    SINGLE(I) 1525 920 891 634 96.8 58.4

    MULTI(I) 1525 1453 1247 252 85.8 81.8

    MULTI(I)-SINGLE(I) NON 533 356 382 11.0 23.3

    SINGLE(C) 388 229 228 160 99.6 58.8

    MULTI(C) 388 324 312 76 96.3 80.4

    MULTI(C)-SINGLE(C) NON 95 84 84 3.3 21.6SINGLE 1913 1149 1119 794 97.4 58.5

    MULTI 1913 1777 1559 328 87.7 81.5

    MULTI-SINGLE NON 628 440 466 9.7 23.0

    NON = represents the empty data.

    Table 8

    Simulation results for non-inadequate (NI) and non-crime (NC) expressions.

    ALL_NON NUM_TN NUM_FP TN (%) FP (%)

    SINGLE(NI) 1277 526 751 41.1 58.8

    MULTI(NI) 1277 1078 199 84.4 15.6

    MULTI(NI)-SINGLE(NI) NON 552 552 43.3 43.2

    SINGLE(NC) 1382 228 160 56.7 43.3

    MULTI(NC) 1382 312 76 91.2 8.8

    MULTI(NC)-INGLE(NC) NON 84 84 34.5 34.5SINGLE 2659 754 911 49.2 50.8

    MULTI 2659 1390 275 87.9 12.1

    MULTI-SINGLE NON 636 636 37.8 38.7

    NON = represents the empty data.

    Table 9

    Evaluation results by SPECIFICITY and SENSITIVITY.

    SPECIFICITY (%) SENSITIVITY (%)

    SINGLE(I + NI) 41.2 58.4

    MULTI(I + NI) 84.4 83.2

    SINGLE(C + NC) 56.7 58.8

    MULTI(C + NC) 91.2 80.4

    SINGLE 49.2 58.5

    MULTI 88.0 82.6

    332 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335

  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    11/13

    Consider the following input and rule structures.

    N1 = {(STR, Killed), (SEM, MURDER&VIOLENCE)}.

    N2 = {(STR, ABC), (CAT, UNKNOWN)}.

    N3 = {(STR, Tokyo Station), (CAT,PLACE)}.

    Rule (7) = R71, R72, R73.

    R71 = {(SEM, MURDER&VIOLENCE)}.R72 = {(CAT, HUMAN)}.

    R73= {(CAT, PLACE)}.

    In this case, N1 R71, N2 R72 and N3 R73 are satisfied because R72 = {(CAT, HUMAN)} is replaced by R72 = {(CAT, UN-

    KNOWN)}. Therefore, transitions are always success by this error recovery.

    Although this error recovery produces many accessible transitions, it is a very practical scheme with robustness because

    it is easy to restrict the upper bound of possible transitions in the practical system.

    It is difficult task to register new words and expressions into dictionaries together with their categories and semantics.

    For this problem, it is clear that the above method does not need to be extended together with the registration. The impor-

    tant point of robustness issue is how to extract possible candidates from malicious expressions with many syntax errors and

    argotic words.

    (b) Hierarchical concept matching.

    Consider delete people at Tokyo station with the following structures.

    N1 = {(STR, delete), (SEM, DELETE)}.

    N2 = {(STR, people), (CAT, NOUN) (CAT, HUMAN)}.

    N3 = {(STR, Tokyo Station), (CAT, PLACE)}.

    Suppose that the semantic meaning of the verb delete is the super-category DELETE of category MURDER&VIOLENCE

    and suppose R71 of the above Rule (7) is {(SEM, DELETEnMURDER&VIOLENCE)}, where n means a hierarchical notation.

    Hierarchical concept matching is succeeded if DELETE of N1 is equal to the super-category DELETE of

    DELETEnMURDER&VIOLENCE ofR71. This matching is weak because it is not perfect, but the extended matching is practical

    in the error recovery that the similar expressions can be extracted. That is to say, it enables us to support rule-based knowl-

    edge using concepts.

    The rules bases of the presented method MULTI is building for frequent expressions step by step, but there are difficult

    problems as shown in the following examples:

    RQJmcf2O kill Aaaaqqqbbb, where RQjmcf2O and Aaaaqqqbbb are user ID.

    Context analysis for a sequence of articles including past information should be proposed, but the current system has no

    ability to describe applicable rules. This technique depends on discourse analysis and remains in the future research.

    Moreover, there are some hUngrammatical sentencei as follows:

    D

    R

    U

    G

    S

    To solve this problem, special frozen analysis must be introduced case by case and remains in the future research.

    Support Vector machine (SVM) is a well-known approach. SVMs depend on words or a sequence of words without con-

    sidering the context of articles which has many disadvantage (Burgess, 1998) as follows:

    (1) The biggest limitation of the support vector approach lies in choice of the kernel.

    (2) The second limitation is speed and size, both in training and testing.

    (3) Discrete data presents another problem.

    (4) The most serious problem with SVMs is the high algorithmic complexity and extensive memory requirements in large-

    scale tasks (Horvth, 2003).

    However, the detected results by the presented method can use SVM schemes as the learning futures because SVMs re-

    quire a lot of correct training data. That is to say, SVMs and the presented method can work in a coordinated manner.

    H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 333

  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    12/13

    5.3. Time evaluation and error analysis

    The detecting ability of the presented method is excellent as described above, but it is very important to evaluate the time

    performance together with essential error analysis for the whole system consisting of the following modules.

    The first module FOCUS determines the essential text by removing redundant texts (advertising parts) from Web pages.

    This module is carries out by HTML tag processing. High frequent pages with the same tag format as recommending products

    and news are removed in this module to reduce the error rate of false positive. In the second module, morphological analysis

    MOPH determines part of speeches and fundamental concepts. For error analysis, unknown expressions are detected in this

    module. The module KW determines keywords consisting of sequential expressions from the results of morphological anal-

    ysis. The module FIELD determines document fields (Atlam, Elmarhomy, Fuketa, Morita, & Aoe, 2006; Atlam, Fuketa, Morita,

    & Aoe, 2003; Fuketa, Lee, Tsuji, Okada, & Aoe, 2000; Fuketa et al., 2005). This module is carried out by matching field asso-

    ciation words to the results of morphological analysis. The results of this module can be used to reduce the error rate by false

    positive. Examples are game and computer fields for (b1) and (b2) in Fig. 1, and for Rule (13) and Rule (14) in Table 2, respec-

    tively. The next module NE determines named entities such as names, organizations, places and so on. This module is carried

    out for the results of keywords analysis (Asahara and Matsumoto, 2003; Wright & Budin, 1997). For example, ABC Station is

    a station name and Nagoya company is a company name. In the error analysis of this modules, unknown word ABC and

    ambiguous name Nagoya in the module MOPH can be solved. The module ATTR determines SC expressions by using the

    presented multi-attribute method.

    SINGLE uses FOCUS, MOPH, KW and NE. MULTI uses FOCUS, MOPH, NE and ATTR, where NE is included ATTR.

    The presented system has been developed Windows 2003 server and two CPU of Intel Xeon E5440 (2.83 GHz) with 2 GB

    main memory. Fig. 5 shows the time expenses for the above modules, where the analysis time is estimated for 100 articles of

    HTML and its text (TEXT). The sizes of HTML and TEXT are 1 MB and 60 KB, respectively.

    For HTML documents in Fig. 5, it turns out that the time of the presented method is practical although MULTI is about 1.28

    times slower than SINGLE. In fact, MORH and FOCUS can be performed by the preprocess servers, so the analysis time of the

    main module ATTR of MULTI becomes 20 ms for a text article.

    6. Conclusion

    The extracting scheme of traditional methods depends on words or a sequence of words without considering the context

    of articles. Therefore, many irrelevant candidates of possible malicious expressions are extracted. Although the current fil-tering scheme can precisely alert malicious articles, many non-malicious articles are not detected well. In order to solve

    these problems, this paper has presented a new filtering algorithm to detect SC expressions by introducing multi-attribute

    rules (MULTI). For 11,019 articles, it has been verified that the presented method could improve the rate of false positive of

    the traditional method without degrading the rate of false negative. Therefore, we can say that the presented method MULTI

    is a very useful approach for filtering services for inadequate expressions.

    In future work, it needs to build rule-based knowledge for many types of malicious postings together with the error

    recovery.

    References

    2 channel. .Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6), 333340.Altman, D. G., & Bland, J. M. (1994). Diagnostic tests: Sensitivity and specificity. BMJ, 308, 1552.

    Ando, K., Mizobuchi, S., Shishibori, M., & Aoe, J. (1998). Efficient multi-attribute pattern matching. An International Journal of Computer Mathematics, 66(1+2),2138.

    Fig. 5. Time evaluation of the presented method and the traditional method.

    334 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335

    http://www.2ch.net/http://www.2ch.net/
  • 7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems

    13/13

    Anichiva. .Asahara, M., & Matsumoto Y. (2003). Japanese named entity extraction with redundant morphological analysis. In Proc. of HLTNAACL 03 (pp. 815).Atlam, E.-S., Elmarhomy, G., Fuketa, M., Morita, K., & Aoe, J. (2006). Automatic building of new field association word candidates using search engine.

    Information Processing & Management Journal, 42(4), 951962.Atlam, E.-S., Fuketa, M., Morita, K., & Aoe, J. (2003). Documents similarity measurement using field association terms. An International Journal of Information

    Processing and Management, 39(6), 809824.Burgess 1998. .Children Internet Protection Act. .Claypool, M., Brown, D., LE, P., & Waseda, M. (2001). Inferring user interest. IEEE Internet Computing, 5, 3239.Digital Economic Act. .

    Francis, W., Frantz, V. & Mathieu, S., (2000). Using learning-based filters to detect rule-based filtering obsolescence. Article in proceeding of researchinformation assister of ordinate, RIAO 2000, Paris.

    Fuketa, M., Kadoya, Y., Atlam, E.-S., Kunikata, T., Morita, K., Kashiji, S., et al (2005). A method of extracting and evaluating good and bad reputations fornatural language expressions. Information Technology & Decision Making, 4(2), 77196.

    Fuketa, M., Lee, S., Tsuji, T., Okada, M., & Aoe, J. (2000). A document classification method by using field association words. An International Journal ofInformation Sciences, 126(1), 5770.

    Gakkou-Ura. (in Japanese).Gharieb, R. R. (2000). Higher order statistics based IIR notch filtering scheme for enhancing sinusoids in colored noise. IEE Proceedings Vision Image and

    Signal Processing, 147(2), 115121.Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35, 6170.Goldberg, K., Roeder, T., Guptra, D., & Perkins, C. (2001). Eigentaste: A constant-time collaborative filtering algorithm. Information Retrieval, 4, 133151.Good, N., Schafer, J. B., Konstan, J. A., Borchers, A., Sarwar, B. M., & Harter, S. P. (1996). Variations in relevance assessments and the measurement of retrieval

    effectiveness. Journal of the American Society for Information Science, 47, 3749.Heckerman, D., Chickering, D. M., Meek, C., Rounthwite, R., & Kadie, C. (2000). Dependency networks for inference, collaborative filtering, and data

    visualization. Journal of Machine Learning Research, 1, 4975.Herlocker, J. L., Konstan, J. A., & Riedl, J. (2002). An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms. Information

    Retrieval, 5, 287310.

    Horvath, 2003. Horvth (2003) in Suykens et al. p. 392.Kadoya, Y., Morita, K., Fuketa, M., Ohono, M., Atlam, E.-S., Sumitomo, T., et al (2005). A sentence classification technique by using intention association

    expressions. Computer Mathematics, 82(7), 777792.Kim, S., Min, H., Jeon, J., Man Ro, Y., & Han, S. (2009), Malicious content filtering based on semantic features. In Proceedings of the ACM international conference

    on interaction sciences: Information technology, culture and human (Vol. 403, pp. 802806), Seoul, Korea.Kiyoi, K., Atlam, E.-S., Fuketa, M., Yoshinari, T., & Aoe, J. (2008). A method for extracting knowledge from medical texts including numerical representation.

    International Journal of Computer Applications in Technology, 33 (2/3), 226236.Landau, M.C., Sillion, F., & Vichot, F. (1993), Exoseme: A thematic document filtering system. In Intelligence Artificial, Avignon, France.Larry, M., & Malik, Y. (2001). One-class SVMs for document classification. Journal of Machine Learning Research, 139, 154.Lee, W., Lee, S., Chung, S., & An, D. (2007), Harmful contents classification using the harmful word filtering and SVM. In Proceedings of the 7th international

    conference on computational science, Part III: ICCS 2007 (pp. 1825), May 2730, 2007, Beijing, China.Livejournal. .Mixi. (in Japanese).myspace. .Pennock, D. M., Horvitz, E., Lawrence, S., & Giled, C. L. (2000). Collaborative filtering by personality diagnosis: A hybrid memory and, model-based approach.

    In Proceedings of the sixteenth annual conference on uncertainty in artificial intelligence (UAI-2000) (pp. 473480), Morgan Kaufmann, San Francisco.Provider Liability Act. .

    Reddy, P. K., Kitsuregawa, P., Sreekanth, P., Rao, S. S. (2002), A graph based approach to extract a neighborhood customer community for collaborativefiltering. In Lecture notes in computer science databases in networked information systems, second international workshop (pp. 188200), Springer.

    Shiraki, N., Hara, M., Ogino, H., Shibamoto, Y., Iida, A., Tamaki, T., et al (2004). False-positive and true-negative hilar and mediastinal lymph nodes on FDG-PET Radiologicalpathological correlation. Annals of Nuclear Medicine, 18(1), 2328.

    Wang, J., Arjen, P., Marcel, J.T. (2006). Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In Proceeding of SIGIR 2006.August 611, 2006, Seattle, Washington, USA.

    Wright, S. E., & Budin, G. (1997). Handbook of terminology management. Basic aspects of terminology management (Vol. 1). Amsterdam, Philadelphia: JohnBenjamins.

    Xu, J., Chong, Z., Lu, H., Zhou, A. (2004). False positive or false negative: Mining frequent itemsets from high speed transactional Data streams. In Proceeding ofthe 30th VLBD (pp. 204215), Toronto, Canada.

    Yahoo! BBS. .Yokoku.in. (in Japanese).Yoohwan, K., Wing, C., Mooi, C., & Chao, H. Jonathan (2006). PacketScore: A statistics-based packet filtering scheme against distributed denial-of-service

    attacks. IEEE Transactions on Dependable and Secure Computing, 3(2), 141155.Yoshinari, T., Atlam, E.-S., Morita, K., Kiyoi, K., & Aoe, J. (2008). Automatic acquisition for sensibility knowledge using co-occurrence relation. International

    Journal of Computer Applications in Technology, 33(2/3), 218225.Youth Protection Act. .

    H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 335

    http://cn.anchiva.com/download/Commtouch%20URL%20Filtering%20White%20Paper_Anichiva_En.pdfhttp://www.svms.org/disadvantages.htmlhttp://en.wikipedia.org/wiki/Children's_Internet_Protection_Acthttp://en.wikipedia.org/wiki/Children's_Internet_Protection_Acthttp://www.legifrance.gouv.fr/affichTexte.do?cidTexte=JORFTEXT000000801164&dateTextehttp://schecker.jp/http://www.livejournal.com/http://mixi.jp/http://us.myspace.com/http://law.e-gov.go.jp/htmldata/H13/H13HO137.htmlhttp://messages.yahoo.co.jp/index.htmlhttp://yokoku.in/http://www.wien.gv.at/recht/landesrecht-wien/landesgesetzblatt/jahrgang/2002/html/lg2002017.htmhttp://www.wien.gv.at/recht/landesrecht-wien/landesgesetzblatt/jahrgang/2002/html/lg2002017.htmhttp://yokoku.in/http://messages.yahoo.co.jp/index.htmlhttp://law.e-gov.go.jp/htmldata/H13/H13HO137.htmlhttp://us.myspace.com/http://mixi.jp/http://www.livejournal.com/http://schecker.jp/http://www.legifrance.gouv.fr/affichTexte.do?cidTexte=JORFTEXT000000801164&dateTextehttp://en.wikipedia.org/wiki/Children's_Internet_Protection_Acthttp://www.svms.org/disadvantages.htmlhttp://cn.anchiva.com/download/Commtouch%20URL%20Filtering%20White%20Paper_Anichiva_En.pdf