Predicting associated statutes for legal problems

18
Predicting associated statutes for legal problems Yi-Hung Liu a,, Yen-Liang Chen a , Wu-Liang Ho b a Department of Information Management, National Central University, Chung-Li 320, Taiwan, ROC b Department of Legal Service, Straits Exchange Foundation, Taipei 105, Taiwan, ROC article info Article history: Received 16 October 2013 Received in revised form 29 May 2014 Accepted 9 July 2014 Available online xxxx Keywords: Text mining Statute Criminal judgment Normalized Google Distance (NGD) Support vector machines (SVM) Apriori algorithm abstract Applying text mining techniques to legal issues has been an emerging research topic in recent years. Although a few previous studies focused on assisting professionals in the retrieval of related legal documents, to our knowledge, no previous studies could provide relevant statutes to the general public using problem statements. In this work, we design a text mining based method, the three-phase prediction (TPP) algorithm, which allows the general public to use everyday vocabulary to describe their problems and find pertinent statutes for their cases. The experimental results indicate that our approach can help the general public, who are not familiar with professional legal terms, to acquire relevant stat- utes more accurately and effectively. Ó 2014 Elsevier Ltd. All rights reserved. 1. Introduction The law represents social norms, which protect civilian rights and maintain social order. To achieve these objectives, the legislature enacts legal provisions preserving people’s rights to life, property, and so on. As an increasing number of individ- uals’ rights have been violated, more and more litigation has occurred. When people have legal issues, it is critical to know which statutes are involved. Usually, because of a lack of legal knowl- edge, they may seek help from legal experts, such as attorneys, or from automatic systems. Since legal consultation is very costly, automatic systems are a much more affordable form of legal support. They involve utilizing search engines on the Internet or searching legal databases, such as Westlaw International (2013) and LexisNexis (2013). Although these automatic systems provide query methods, users cannot obtain pertinent statutes through simple case statements. This motivated us to propose a new approach that will help laypeople obtain relevant statutes by simply stating their problem or case using daily vocabulary, and without the help of legal experts. In Fig. 1, we show the proposed statute retrieval approach. The purpose of this research is to provide a statute retrieval method that will help people deal with their legal problems more effectively. It can be helpful to at least two types of people. First, for specialists, this method can reduce workloads and be used as a reference when dealing with legal cases. As the sheer number of legal cases increase, legal experts need to expend more time and energy in their work. This research provides an aid for efficiently processing cases. Second, for lay- people, this approach can reduce searching and consultation needs. When laypeople have legal issues, it is difficult for them to acquire related statutes though existing automatic systems because of their insufficient knowledge of professional legal terms. This results in incorrect search results and lengthens the searching process. With the help of our approach, this insuf- ficient knowledge problem can be alleviated. http://dx.doi.org/10.1016/j.ipm.2014.07.003 0306-4573/Ó 2014 Elsevier Ltd. All rights reserved. Corresponding author. E-mail address: [email protected] (Y.-H. Liu). Information Processing and Management xxx (2014) xxx–xxx Contents lists available at ScienceDirect Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management (2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Transcript of Predicting associated statutes for legal problems

Page 1: Predicting associated statutes for legal problems

Information Processing and Management xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Information Processing and Management

journal homepage: www.elsevier .com/ locate/ infoproman

Predicting associated statutes for legal problems

Yi-Hung Liu a,⇑, Yen-Liang Chen a, Wu-Liang Ho b

aDepartment of Information Management, National Central University, Chung-Li 320, Taiwan, ROCbDepartment of Legal Service, Straits Exchange Foundation, Taipei 105, Taiwan, ROC

a r t i c l e i n f o

Article history:Received 16 October 2013Received in revised form 29 May 2014Accepted 9 July 2014Available online xxxx

Keywords:Text miningStatuteCriminal judgmentNormalized Google Distance (NGD)Support vector machines (SVM)Apriori algorithm

a b s t r a c t

Applying text mining techniques to legal issues has been an emerging research topic inrecent years. Although a few previous studies focused on assisting professionals in theretrieval of related legal documents, to our knowledge, no previous studies could providerelevant statutes to the general public using problem statements. In this work, we design atext mining based method, the three-phase prediction (TPP) algorithm, which allows thegeneral public to use everyday vocabulary to describe their problems and find pertinentstatutes for their cases. The experimental results indicate that our approach can help thegeneral public, who are not familiar with professional legal terms, to acquire relevant stat-utes more accurately and effectively.

� 2014 Elsevier Ltd. All rights reserved.

1. Introduction

The law represents social norms, which protect civilian rights and maintain social order. To achieve these objectives, thelegislature enacts legal provisions preserving people’s rights to life, property, and so on. As an increasing number of individ-uals’ rights have been violated, more and more litigation has occurred.

When people have legal issues, it is critical to know which statutes are involved. Usually, because of a lack of legal knowl-edge, they may seek help from legal experts, such as attorneys, or from automatic systems. Since legal consultation is verycostly, automatic systems are a much more affordable form of legal support. They involve utilizing search engines on theInternet or searching legal databases, such as Westlaw International (2013) and LexisNexis (2013). Although these automaticsystems provide query methods, users cannot obtain pertinent statutes through simple case statements. This motivated us topropose a new approach that will help laypeople obtain relevant statutes by simply stating their problem or case using dailyvocabulary, and without the help of legal experts. In Fig. 1, we show the proposed statute retrieval approach.

The purpose of this research is to provide a statute retrieval method that will help people deal with their legal problemsmore effectively. It can be helpful to at least two types of people. First, for specialists, this method can reduce workloads andbe used as a reference when dealing with legal cases. As the sheer number of legal cases increase, legal experts need toexpend more time and energy in their work. This research provides an aid for efficiently processing cases. Second, for lay-people, this approach can reduce searching and consultation needs. When laypeople have legal issues, it is difficult for themto acquire related statutes though existing automatic systems because of their insufficient knowledge of professional legalterms. This results in incorrect search results and lengthens the searching process. With the help of our approach, this insuf-ficient knowledge problem can be alleviated.

http://dx.doi.org/10.1016/j.ipm.2014.07.0030306-4573/� 2014 Elsevier Ltd. All rights reserved.

⇑ Corresponding author.E-mail address: [email protected] (Y.-H. Liu).

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 2: Predicting associated statutes for legal problems

Fig. 1. Framework of relevant statute retrieval.

2 Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx

In recent years, text mining research has gotten more and more attention. Basically, text mining is the procedure ofuncovering salient features and information from textual data. Since most human knowledge is stored in text, abundant textmining applications and methods have been developed. Examples of these text mining applications include patent retrieval(Chen & Chiu, 2011; Tikk, Biró, & Törcsvári, 2007), e-mail security (Bergholz et al., 2010; Wei, Chen, & Cheng, 2008), newscategorization (Calvo, 2001; Zheng, Milios, & Watters, 2002), authorship identification (Stamatatos, 2009; Zheng, Li, Chen,& Huang, 2006), scientific document retrieval (Kaur, Yusof, Boursier, & Ogier, 2010), document sentiment analysis (Li &Wu, 2010; Schumaker, Zhang, Huang, & Chen, 2012), document summarization (Goldstein, Mittal, Carbonell, &Kantrowitz, 2000; Li, Du, & Shen, 2013; Wang, Zhu, Li, & Gong, 2009), online advertisement recommendations(Thomaidou & Vazirgiannis, 2011; Wang, Wang, Duan, Tian, & Lu, 2011), search engines (Kawai, Jatowt, Tanaka, Kunieda,& Yamada, 2011; Yin, 2007), etc.

Text mining has been applied in various areas. Although a few past studies applied text mining techniques to the legaldomain (Chen & Chi, 2010; Chou & Hsing, 2010; Conrad & Schilder, 2007; Moens, 2001), all of them focused only on helpingprofessional users retrieve or classify legal documents. None of them considered how to help laypeople retrieve relevantstatutes from a case statement using daily customary terms. Therefore, this research aims to develop a framework for statuteprediction that will remedy this problem. This framework is built with judgments and statutes. Judgments are included inthe framework because they contain the facts of the crime and the cited statutes from the judge’s adjudication. From them,we can find the connections between the problem and the cited statutes. In turn, these connections help us to determine themost relevant statutes with respect to the user’s problem.

The prediction method process is shown in Figs. 2 and 3, Batch and Online, respectively. In the Batch process, three out-puts are generated to be used in the Online process. The first output is a classification model that classifies cases to statutes,and is produced by adopting a SVM (support vector machine) classifier. In the second output, all statutes are represented asstatute vectors. The last output is a set of association rules, which show what statutes frequently occur together, and is gen-erated from the training collection of judgments. In the Online process, the classification model is adopted to acquire theprediction of top k1 statutes for the user query. Then, the NGD (Normalized Google Distance) method (Cilibrasi & Vitanyi,2007; Evangelista & Kjos-Hanssen, 2006) is used to perform terms transformation between the statutes and user query,and the top k2 most similar statutes are selected. Finally, by applying associative statute rules to the top k2 statutes, the stat-ute weight computation metric is defined, so as to obtain the most relevant statutes for the user query.

The advantages of our approach are that (1) ordinary users can express their cases using daily vocabulary, (2) a bridge iscreated between laypeople and legal statutes, and (3) the most pertinent statutes are recommended to users. This workacquires relevant statutes by developing a three-staged algorithm. In the first stage, we utilize the multi-label SVM text

Fig. 2. Batch process of the prediction approach.

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 3: Predicting associated statutes for legal problems

classifier to classify the cases into k1 statutes. Then, from these k1 statutes, the second stage selects the most similar k2 stat-utes, with respect to the user query, by employing the semantic relatedness measure (i.e., Normalized Google Distancemethod). Finally, from these k2 statutes, we select the final set of statutes by considering the associations among statutes.The contributions of this research are as follows.

1. It is motivated by a real-world societal needs arose from general public. To our knowledge, this innovative approach tostatute prediction in the legal domain has not been attempted in any previous research.

2. It proposes an innovative approach, called TPP (Three Phase Prediction), to predict relevant statutes for the problemdescribed by user. The core of the approach is to remedy the gap between lay terms and legal terms without using asynopsis.

3. An evaluation metric (i.e., coverage) was adopted to confirm the performance of the proposed TPP approach. Experimen-tal results show that it performs accurately and effectively.

The rest of this paper is organized as follows. The relevant literature is first reviewed and described in Section 2. Theresearch design is then introduced in Section 3. In Section 4, we discuss the experiment results and evaluations. Finally, con-clusions and future directions are presented in Section 5.

2. Literature review

2.1. Background

The legal system in Taiwan is based upon the written law, which is enacted by the legislature or the congress, and whoalso develop a variety of codes and regulations. In court cases, the judge’s sentence must be in accordance with the stipu-lations and statutes of the law. Furthermore, to determine a clear sentence, the judge must interpret the statutes and thefactual circumstances of the case. The statutes and judgments play an important role in the judge’s verdict in governingthe society.

A judgment is the final decision by a court in a lawsuit to resolve the controversial issues and terminate the lawsuit. Thecourt may also make a range of court orders, such as imposing a sentence upon a guilty defendant in a criminal matter, orproviding a remedy for the plaintiff in a civil matter. In addition, a judgment also signifies the end of the court’s jurisdictionin the case. The form of the judgment is generally revealed in a combination of segments; some are necessary, and others areoptional, arranged in a fixed or partially fixed order. A judgment consists of the file number, the accused, the counsel, thecause of action, the main body of a court verdict, the facts and reasons, the cited statutes, the date of judgment, and thejudge.

In our study, we utilize a collection of judgments as training documents. Two relevant parts of a judgment are employedfor processing, the facts and the cited statutes. Representative keywords for a judgment are extracted from the facts, whilethe cited statutes can be seen as a classification label for the judgment. Both the keywords and the cited statutes are used totrain a text classifier. Additionally, a statute’s contents and its relationship are two important traits needed to accuratelyacquire relevant statutes.

Fig. 3. Online process of the prediction approach.

Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx 3

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 4: Predicting associated statutes for legal problems

2.2. An overview of text mining

Text mining is an analysis process that discovers hidden features and extracts sensitive information from the sheer vol-ume of documents for further processing. It uses techniques from information retrieval, information extraction (IR), as wellas natural language processing (NLP), and connects them using data mining, machine learning, and statistics.

In general, an IR system is composed of three components: Documents, Queries and Matching/Ranking functions. Mostprior studies are based on the idea that text documents can be represented as vectors in multi-dimensional space, calledthe Vector Space Model (VSM) (Salton, Wong, & Yang, 1975). In VSM, documents and queries are represented as vectorsof terms and its accuracy and performance relies on the selected vector base. To determine the matching function, the sim-ilarity between queries and documents can be computed using the distance or correlation between the corresponding vec-tors (Baeza-Yates & Ribeiro-Neto, 1999; Hotho, Nürnberger, & Paaß, 2005; Salton et al., 1975). Some examples of thesemethods include Euclidean distance, cosine measure, and Pearson coefficient. According to Trappey and Trappey (2008), sev-eral weighting schemes have been proposed to compute the terms weights, such as TF (Baeza-Yates & Ribeiro-Neto, 1999),TF-IDF (Salton, Allan, & Buckley, 1994; Salton & McGill, 1983), entropy (Hotho et al., 2005; Lochbaum & Streeter, 1989), andchi-squares (v2) (Feldman & Sanger, 2007; Li, Luo, & Chung, 2008). Among these schemes, TF-IDF is a welcome term weightscheme introduced in VSM. Various evaluation studies (Salton, 1989; Salton & Buckley, 1988) have shown that VSM was oneof the most successful models and most existing information retrieval systems were designed based on it.

There are multiple IR systems that have been developed for utilizing and facilitating in different fields. Every IR systemhas its own traits, architecture and limitations, but contributes to specific or open domains. In the late 1960s, the SMARTinformation retrieval system was implemented and presented many important concepts in research, including the vectorspace model, relevance feedback, and Rocchio classification (Buckley, 1985). Over the past few years, a wide variety of IRsystems were proposed, including search engines (e.g., Google and Yahoo! Search), digital library (e.g., Digital Public Libraryof America), and information filtering (e.g., Spam filter), etc. In particular, a flexible and functional API Lucene (lucene.apache.org) was introduced, which is a Java-based indexing and searching engine library and a technology appropriate for applica-tions that require full-text search. It provides several features with many query types (e.g., Phrase queries, wildcard queriesand more), fielded searching, multiple-index searching with merged results and two ranking models: VSM and BM25(Robertson & Zaragoza, 2009).

The main concern of Question–answering (QA) system is that it can automatically answer accurate questions posed byhumans in a natural expression of queries. Typically, QA systems can be divided into two categories: open-domain anddomain-specific. An open-domain QA system aims at returning an answer to a user’s question with short texts rather thana list of relevant documents. START (start.csail.mit.edu) is one of the earliest web-based systems that can be publicly-acces-sible since 1993 (Katz, 1997) and it plays a notable role in the evolution of QA systems. In 1999, Text REtrieval Conference(TREC) commenced to supply a standard QA evaluation track from a large collection of documents. A famous open-domainQA system, IBM’s Watson system (Ferrucci et al., 2010), is developed by the IBM DeepQA research team. The main designidea behind Watson is that it synthesized information retrieval, natural language processing, knowledge representationand reasoning, machine learning, and computer–human interfaces. The information resources include encyclopedias, dictio-naries, thesauri, newswire articles, and literary works, etc. Besides, the sources are identified and collected in content-acqui-sition process, which include databases, taxonomies, and ontologies, such as dbPedia (dbpedia.org), WordNet (Miller, 1995),and the Yago (www.mpi-inf.mpg.de/yago-naga/yago/) ontology. AlthoughWatson provides accurate answers to questions, itis not available to the public.

For domain-specific question answering, there have been fewer recent works developed, like a medical domain QA sys-tem, MedQA (Lee et al., 2006), which employs supervised machine learning approach to perform question classificationbased on an evidence taxonomy built up by physicians. In engineering education domain, Diekema, Yilmazel, and Liddy(2004) developed the Knowledge Acquistion and Access System (KAAS) QA system using a user-oriented approach in a col-laborative learning environment. L’opez-Moreno et al. (2007) proposed an analysis study of the problems and challenges ofQA systems in academic domain. Despite those promising QA systems proposed recently, however, to our best knowledge,there is still no well-built QA system in legal domain.

2.3. Applications of text mining

Text mining has been successfully applied in many diverse areas, with each of the applications exhibiting specific traits.When text appears in new document types, it usually leads to novel text mining applications. For example, text miningmethods can be used in news texts to determine the news category (Calvo, 2001; Zheng et al., 2002). For patent documents,patent retrieval methods can be developed according to text similarities (Chen & Chiu, 2011; Tikk et al., 2007). In e-mail, wecan determine if the mail is spam by analyzing its text (Bergholz et al., 2010; Wei et al., 2008). For scientific documents, intel-ligent search methods can be designed by comparing text similarities and other attributes (Kaur et al., 2010). In a commen-tary, text mining methods can help discover the main themes, the authors’ attitudes, and sentiments (Hsu, Chen, Lin, Hsieh,& Shih, 2012; Li & Wu, 2010; Reyes, Rosso, & Buscaldi, 2012; Schumaker et al., 2012).

Additionally, text mining methods can be used to guess the identities of anonymous authors (Stamatatos, 2009; Zhenget al., 2006). They can also be used to automatically generate an abstract or summary of long documents based on sentencefeatures (Goldstein et al., 2000; Li et al., 2013; Wang et al., 2009). Furthermore, commercial web sites can use text mining

4 Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 5: Predicting associated statutes for legal problems

methods to recommend appropriate advertisements (Thomaidou & Vazirgiannis, 2011; Wang et al., 2011). As one of themost important contributions, the modern search engine was developed by utilizing text mining techniques (Kawai et al.,2011; Yin, 2007). In the last few years, QA has received more attention. Community-based QA (CQA) is an emerged focusof prior studies for information seeking online, and it retrieves the most appropriate answers by segmenting multi-sentenceuser questions (Wang, Ming, Hu, & Tat-Seng Chua, 2010). For question retrieval, the analysis of many questions was not onlybased on factual knowledge, but incorporated sentiment analysis of user intent (Chen, Zhang, & Levene, 2013). Multimediacontent analysis approach was proposed to help text QA acquire relevant answers by adopting multimedia information suchas image and video (Nie, Wang, Zha, Li, & Chua, 2011). It is impossible to list every text mining application, but the abovediscussion should illustrate the impact text mining has had in countless situations.

2.4. Related academic research on text mining in the legal domain

Text mining in the legal domain has been an emerging research topic in recent years. So far, only several studies havebeen done on this topic. Moens (2001) gave an overview of text mining methods and discussed their potential to helpimprove legal document retrieval. Besides, Moens (2005) also proposed several XML retrieval models which exploit thestructured and unstructured legislative document information to a query. EgoIR (Gomez-Perez, Ortiz-Rodriguez, &Villazon-Terrazas, 2007) is a system that provides an efficient way to retrieve legal information from e-Government docu-ments based on Legal Ontology. Conrad and Schilder (2007) presented an opinion mining application on legal web blogsevaluated by sentiment analysis. In Chen and Chi (2010), their aim was to retrieve the most similar historical judgmentsfor the prosecutor using the police’s criminal investigation documents. Chou and Hsing (2010) developed a legal documentclassification, clustering, and search methodology based on neural network technology, helping law enforcement to managecriminal written judgments. Chen, Liu and Ho (2013) introduced an approach to assist the general public in retrieving themost relevant judgments using ordinary terminology or statements as queries.

All of the above studies were designed to support users in retrieving or managing related legal documents. They did not,however, consider providing relevant statutes with respect to legal problems. This deficiency motivated us to propose a stat-ute prediction system that is custom-designed for the general public.

3. Research design

Since the criminal code includes many kinds of crime, and the judgment for a specific offense is usually dependent on thearticles and the judge’s perspective, there has been no absolute standard for court decisions. Therefore, designing automaticmethods to determine the exact statutes for a target judgment has been extremely difficult. Intuitively, a legal document canbe regarded as a text document. However, several differences exist between legal documents and normal documents.

As Table 1 illustrates, there are four major differences. These four unique characteristics of legal judgments suggest a needto develop a brand new approach that can address such differences.

(1) The terms used by laypeople differ from the ones that appear in legal documents, such as judgments and statutes.Problems arise when we search for relevant statutes using laypeople’s queries due to keyword differences.

(2) In statute prediction, a statute is regarded as a label in the classification model. Since each judgment has at least onecited statute, judgments have multi-labels.

(3) In traditional document classification, a label is a tag without content. In statute prediction, however, a statute is aparagraph of text or a sentence. In other words, a label is no longer just a tag, but content-rich text.

(4) In traditional document classification, labels are independent of each other. In judgments, however, some statutes aremore likely to appear together than others. Association relations may exist between statutes.

In order to more accurately determine statutes, our method must take into account these four characteristics. Therefore,we propose a three-phase prediction (TPP) approach that can be used to improve the accuracy of traditional text classifiersfor judgments. There are two collections of criminal judgments, a training collection and a test collection, where the TPPapproach can perform automated statute prediction. Before classifying the judgments, a text preprocessing procedure isperformed on all documents in the training collection. In the preprocessing procedure, we conduct CKIP (2013) for

Table 1Summary of differences between traditional texts and legal judgments.

Traditional texts Legal judgments

Query keywords = Document keywords Query keywords– Document keywordsSingle labels Multi-labelsLabel without content Label with contentLabels are independent Labels are dependent

Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx 5

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 6: Predicting associated statutes for legal problems

segmentation, Chinese stop word elimination, and POS filtering operations. Most studies have traditionally selected nouns torepresent documents. In criminal case documents, however, other POS tags are also meaningful such as adjectives (共同,together), nouns (罪犯, criminal), and verbs (破壞, damage). Therefore, we revised the traditional POS filtering rule to keepnouns, verbs, and adjectives as candidate terms.

For simplicity, we introduce our method based on the three phases of the online process; we do not discuss the batchprocess separately. However, whenever needed in discussing the online process, we will explain the procedures from thebatch process.

The TPP approach is depicted in Fig. 4. The design framework is separated into three phases: (1) select the top k1 statutes,(2) select the top k2 statutes, and (3) select the final predicted statutes. Fig. 4 shows which characteristics are addressed ineach phase. These three phases are described and explained in detail below.

3.1. Phase 1: Select the top k1 statutes

The SVM algorithm is a well-known multi-label classification algorithm. It has several specific advantages in processingnonlinear and high dimensional model identification problems, and small data samples. When classifying, it also minimizesranking loss and generates a better classification model for prediction (Tsoumakas, Katakis, & Vlahavas, 2010).

In this phase, the SVM classifier is applied to the training collection of judgments to generate a classification model. Thisclassification model is used to predict statutes for the input query. Fig. 5 shows the process of classification model genera-tion. After preprocessing, feature selection is necessary to reduce the document dimensions; then, the Vector Space Docu-ment (VSD) model is used to represent the documents.

This research attempts to generate representative vectors for each criminal case in the gathered document set. Through-out this paper, each document (i.e., criminal judgment) is denoted by term set {t1, t2, . . . , tm}, where ti is a term and m is thenumber of terms that occur in the document. The relevance of each ti in document dj is given a weight, and we use the fol-lowing weighting schemes for assigning weights to terms: tf(ti,j) � idf(ti), w(ti,dj).

tf ðti;jÞ ¼ freqðti;jÞ ) the normalized frequency of term ti in document dj

idf ðtiÞ ¼ logðn=niÞ ) n : number of documents;ni : frequency of term ti in document collection

wðti;djÞ ¼ tf ðti; jÞ � idf ðtiÞ ) the weight of term ti in document dj

Since not all features in a document collection are helpful for classification, features that are less discriminative must beeliminated in the classification process. Feature selection is a necessary step in reducing document dimensions. Dimension-ality reduction has been widely explored in single-label data, but these feature selection methods (Liu & Yu, 2005; Ribeiro,Neto, & Prudêncio, 2008; Rogati & Yang, 2002) cannot be directly applied to multi-label data. Therefore, a multi-label entropymethod (Clare & King, 2001) is employed, which calculates the entropy of each term ti across statutes as follows:

Fig. 4. Three-phase prediction approach.

Fig. 5. Process of classification model generation.

6 Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 7: Predicting associated statutes for legal problems

EntropyðtiÞ ¼ �Xr

k¼1

ððpðSk tij Þ log pðSk tij ÞÞ þ ðqðSk tij Þ log qðSk tij ÞÞÞ

pðSkjtiÞ denotes the probability of statute Sk given that term ti appears; qðSkjtiÞ ¼ 1� pðSkjtiÞ denotes the probability of statuteSk given that term ti is absent. The smaller the term entropy value, the greater its distinguishing ability. This is because if agiven term ti has a small entropy value, then term ti is concentrated in fewer specific statutes, instead of being widely dis-tributed over numerous statutes.

After feature selection, the training judgments are transformed into vector space documents (VSDs) according to TF-IDF.Additionally, each VSD is assigned a set of labels, where each label is a statute cited in the judgment. Let BASE denote the setof judgment terms, i.e., all terms selected from the training judgments, used as the vector base in Phase 1. Then, these VSDs,as well as their labels, are analyzed by the SVM algorithm and the classification model is generated. After the classificationmodel is built, the SVM prediction is applied to the user query, as shown in Fig. 6, and the probabilities of all statutes for theuser query are produced. All the statutes are sorted by probabilities, and the top k1 statutes are selected as the output.

To apply the classification model, in this phase, the terms in the user query must be transformed into the judgment termsin BASE. There are a couple difficulties, however, involved in this process. First, a query termmay be related to multiple judg-ment terms, each of which is associated with a different degree of strength. Secondly, it is difficult, if not impossible, to set upa dictionary or an ontology model that can store all these mapping relations.

To overcome these difficulties, Normalized Google Distance (NGD) is employed in this phase. It utilizes a well-knownsearch engine, Google, which can return aggregate page-count estimates for a query or keyword. Normalized Google Dis-tance (Cilibrasi & Vitanyi, 2007; Evangelista & Kjos-Hanssen, 2006) relies on the number of web pages found using a Googlesearch engine that contains a given word; the page count is used to correlate one word or phrase with another word’s mean-ing. In our study, we applied the NGD method to terms transformation.

Fig. 7 demonstrates the process of query terms transformation. The preprocessing procedure is used to segment a userquery into multiple terms. In order to transform query terms into a judgment term set (i.e., BASE), we utilize the NGD for-mula, which is defined as follows:

NGDðti;ujÞ ¼ maxflog f ðtiÞ; log f ðujÞg � log f ðti;ujÞlogM �minflog f ðtiÞ; log f ðujÞg

M denotes the total number of web pages indexed by Google Search, f(ti) denotes the number of web pages containing termti, f(uj) denotes the number of web pages containing judgment term uj, and f(ti,uj) denotes the number of web pages contain-ing both term ti and judgment term uj. The value of the NGD function indicates the degree of similarity between term ti andjudgment term uj using a zero to one scale. A distance value of zero indicates that two terms are practically the same; twoindependent terms have a distance value of one.

Furthermore, we can associate a query term ti with k judgment terms u1, u2, . . . ,uk using different degrees. The vector gi =(gi,1, . . . ,gi,k) is a k-dimensional vector, where gi,j represents the similarity between term ti and judgment term uj. The defini-tion of gij is as follows:

gi;j ¼ 1� NGDðti;ujÞ

Fig. 6. SVM Prediction procedure.

Fig. 7. Process of transforming query terms into judgment terms.

Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx 7

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 8: Predicting associated statutes for legal problems

The higher the value of gi,j, the greater the similarity between term ti and judgment term uj. Based on the computed values,the following method is proposed to transform a query term into multiple judgment terms. This method is divided into foursteps.

Step1: Calculate the similarity value gi,j between query terms ti and each of the k judgment terms uj.Step2: Let pcti,j be the proportion of gi,j and the sum of gi,j over all judgment terms. Additionally, let lwi,q be the weight of

term ti in query q. The formulas of pcti,j and lwi,q are defined as follows:

pcti;j ¼gi;jPgi;j

lwi;q ¼freqi;q

maxlfreql;q

maxlfreql;q is computed over all terms that occur within query q.Step3: Distribute the weight of lwi,q to k judgment terms uj according to pcti,j, which is defined as twi,j.

twi;j ¼ lwi;q � pcti;j

Step4: Sum up the weight twi,j of each term.Figs. 8 and 9 show an example of how to distribute the weight of a query term to judgment terms and how to add

together the weights of each judgment term. Assume that the similarity values of query term T1 to terms U1, U3, and U4

are 0.7, 0.2, and 0.4, respectively, while those of T2 to terms U2, U3, and U4 are 0.8, 0.5, and 0.3, respectively. In step 2, wefound that the percentages of U1, U3, and U4 with respect to T1 are 0.54, 0.15, and 0.31, respectively, while those of U2,U3, and U4 to T2 are 0.5, 0.31, and 0.19, respectively. Therefore, the weights of term T1 distributed to terms U1, U3, and U4

are 0.54, 0.15, and 0.31, respectively, while the weights of term T2 distributed to terms U2, U3, and U4 are 0.3, 0.186, and0.114, respectively. Finally, the weights of judgment terms U1, U2, U3, and U4 are 0.54, 0.3, 0.336, and 0.424, respectively.

3.2. Phase 2: Select the top k2 statutes

Using the statute probabilities from the first phase, we identified the top k1 statutes for further selection. A statute is acontent-rich text that consists of prerequisites and statute terms. In order to calculate the weights of statute terms forthe top k1 statutes, the following formula, called the TF-ISF (term frequency-inverse statute frequency) formula, is used.The relevance of each statute term li in statute sj is measured by a weight, using the following weighting schemes for assign-ing weights to terms: TF(li,j) � ISF(li), w(li,sj).

Fig. 8. Relationship between terms and judgment terms.

Fig. 9. Computation of the judgment terms’ weights.

8 Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 9: Predicting associated statutes for legal problems

TFðli;jÞ ¼ freqðli;jÞ ) the normalized frequency of statute term li in statute sjISFðliÞ ¼ log2ðN=niÞ ) N : number of statutes;

ni : number of statutes containing li in statute collectionwðli; sjÞ ¼ TFðli;jÞ � ISFðliÞ ) the weight of statute term li in statute sj

In this phase, the main objective is to transform the terms in the user query into the statute terms in the statute. However,we encounter the same difficulties as when transforming query terms into judgment terms. Therefore, we applied the sameNGD transformation method as described in Phase 1 to solve these problems.

Fig. 10 illustrates the process of transforming query terms into statute terms. First, the preprocessing procedure is used tosegment a user query into multiple terms. Then, in order to transform the terms, we utilize the NGD method.

Similarly, by applying the transformation method described in Phase 1, we can transform a vector of r terms in a userquery into a list of statute terms and retrieve the most similar k2 statutes from the top k1 statutes. ‘‘Closeness” is definedin terms of the cosine similarity metric, which is used to compute the similarity values between a user query and those stat-utes. Cosine similarity function measures how close user query q is to statute sj on a scale from zero to one. A value of zeroindicates complete irrelevance. A higher similarity value implies greater similarity between the user query and those stat-utes. In this way, we can obtain the top k2 ranking statutes. The cosine similarity function formula is defined as the followingequation:

Simcos ineðsj; qÞ ¼~sj �~q

j~sjj � j~qj ¼Pn

i¼1wi;j �wi;qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPni¼1w

2i;j

q�

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPni¼1w

2i;q

q

where 0 6 Simcos ineðsj; qÞ 6 1.

3.3. Phase 3: Select the final predicted statutes

After Phase 2, the top k2 statutes are acquired. The Apriori association algorithm finds further association rules among thestatutes in order to retrieve more precise statutes. The main purpose is to discover relationships between statutes (i.e., onestatute is cited with another statute). This phase is divided into two steps: (1) mine associative statute rules and (2) deter-mine the final predicted statutes.

When mining associative statute rules, a set of association rules is generated from the training collection. The Apriorimining algorithm is employed to discover associative statute rules. Let S = {s1,s2, . . . ,sr} be the statute set in all training data,where variable r is the total number of statutes in the statute set S. The definition of the associative statute rule is as follows:

fsi si 2 Sj g ! fsj sj 2 S�� g½confidence %�

si is one statute, sj is another statute, and si \ sj ¼ £. For example, {s1}? {s2} [70%] means if statute s1 is cited in the judg-ment, then we are 70% confident that s2 is also cited.

A set of associative statute rules can be found using the Apriori mining algorithm. Letwi be the weight of statute si. We setwi as the similarity of si, with respect to the query, obtained in Phase 2. To find the final predicted statutes among the top k2statutes, the following formula, called the SFW (statute final weight) formula, is used:

SFWj ¼ wj þ log 2M �PM

i¼1wi � conf ði; jÞM

M denotes the number of statute rules with consequent sj, and conf(i,j) denotes the confidence of an associative statute rulesi ! sj.

In accordance with the SFW formula, the SFW values of statutes are computed and ranked. Then, the top candidate pre-dicted statutes can be selected as the final predicted statutes. Suppose we want to obtain the SFW value of a statute, asshown in Fig. 11. Assume that the weight values of s1, s2, s3, s4, and s5 are 0.6, 0.5, 0.3, 0.4, and 0.7, respectively. Also assumethat the confidence of the association rules of s1, s3, s4, and s5 with respect to s2 are 65%, 85%, 75%, and 70%, respectively. Thefinal weights of s1, s3, s4, and s5 with respect to s2 are 0.39, 0.255, 0.3, and 0.49, respectively. In this way, the SFW value of s2(0.823984) is acquired through the SFW formula.

Fig. 10. Process of transforming query terms into statute terms.

Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx 9

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 10: Predicting associated statutes for legal problems

4. Experimental study

4.1. Testbed

To evaluate our TPP approach, we conducted experiments on the Chinese criminal judgments stored in the Law and Reg-ulations Retrieving System (Judicial Yuan, 2013). For each judgment, we collected data related to the fact and the cited stat-utes fields. The fact field describes the criminal facts and processes of the defendants, while the cited statutes are required atthe time of the judge’s sentence. The experiment data used in our research were gathered from the ten most common typesof crime in 2012, as reported by the Judicial Yuan. In total, the data set contains 1518 different criminal judgments and tookabout 3 months to gather manually. Table 2 shows the distribution of criminal judgments over the ten types of crimes.

To test the performance of the TPP approach, we selected 70 examples from metropolitan civil news stories (Udn NewsNet, 2013) as queries. Udn News Net was selected for the query data because it is a well-known news website in Taiwan,containing various types of news stories related to people’s everyday lives and social events. We sent these 70 queries tolegal professionals, including two lawyers and two prosecutors. We expected them to provide at least two and no more thanthree associated statutes for each query. Based on their suggestions, 20 queries were eliminated because of an insufficientnumber of recommended related statutes. Most queries contain no more than three pertinent statutes. As a result, in ourexperiment, 50 queries were used for testing. The length of these 50 query examples varied from 41 to 321 words. The fol-lowing is an example of a query:

Query: 新竹警方查獲 KTV 羅姓女店員暗中抄下多名客人信用卡號和卡片背面認證碼到網路購物盜刷。(HsinChu Police seizeda female KTV clerk named Luo, who secretly wrote down guests’ credit card numbers and authentication codes to conductmisappropriated online shopping.) From this query, we obtain the following useful terms:

Terms: 查獲 (seize), 暗中 (secretly), 信用卡 (credit card), 網路購物 (online shopping), 盜刷 (misappropriated).After all the query and training data were collected, text preprocessing tasks were executed, including word segmenta-

tion, stop-word elimination, and POS tagging. These outcomes were then treated as the input data for the next step.

4.2. Details of implementation

Experiments were conducted to examine the performance of the TPP approach. First, preprocessing was executed for thedocument set. A stop word list containing 7321 words was developed to remove useless words, and 8306 terms wereextracted from the document set. The TF-IDF weighting schemes were then used to determine the feature weights. For fea-ture selection, all feature terms in the term list were screened using the multi-label entropy method (Clare & King, 2001).

Table 2Distribution among training data collection.

Type # Training docs

Offenses Against Public Safety 158Offenses of Larceny 155Offenses of Fraud 157Offenses of Causing Bodily Harm 155Offenses of Forging Instruments or Seals 156Offenses Against Sexual Autonomy 150Offenses of Gambling 149Offenses of Homicide 150Offenses of Misappropriation 146Offenses Against Personal Liberty 142Total 1518

Fig. 11. Computation of the final weight of a statute.

10 Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 11: Predicting associated statutes for legal problems

After feature selection, the selected feature sets consisted of 507 features. In the next step, all document vectors were ana-lyzed using the SVM algorithm and the classification model was generated. We chose libSVM (Chang & Lin, 2001), an opensource multi-label SVM package, to execute the SVM learning and prediction processes.

To select optimal k1 and k2 values, we used 50 query examples as testing documents. Since the TPP approach has threephases, we must evaluate each phase’s performance. Performance was evaluated according to the statutes’ rankings, whichour experts recommended as the output result. In the first phase, the best ranked and the worst ranked experts’ statuteswere 2 and 28, respectively. In other words, in the best case, the experts’ statute was ranked second, while in the worst case,it was ranked 28th. Accordingly, we set k1 = 28. Next, in the second phase, the top 28 ranked statutes for each query werechosen for processing. The statute terms were extracted from these top 28 statutes for each query, such as放火(set fire),致人

於死(resulted in death), and 財產(property). Every query term was transformed into a different number of statute termsusing the NGD transformation method. The TF-ISF weighting schemes were then used to determine the statute termweights.After calculating the similarities between the queries and the statute vectors, we found that the best ranked and the worstranked experts’ statutes were 2 and 16, respectively. Therefore, we set k2 = 16. Finally, the third phase’s results are deter-mined according to the selection of the top 16 statutes for each query, and applying the associative statute rules discoveredby using the Apriori algorithm included in the Weka Data Mining Package (Witten & Frank, 2011). In total, 1082 rules wereproduced. Then, we applied the SFW formula to acquire the final predicted statutes. The best and worst ranking statutesamong these 50 queries were 1 and 13, respectively. By aggregating the results from these three phases, we see that the asso-ciated statutes recommended by legal professionals are covered by selecting the top 13 statutes among all queries withk1 = 28 and k2 = 16. We also found that the performance continuously improved from the first phase to the third phase.

To see how k1 and k2 influence the performance of our TPP approach, we had to test various combinations of k1 and k2. Weset k1 = 28 and k2 = 16 (denoted by TPPBASE) as the baseline combination. Table 3 illustrates the combinations of k1 and k2that we compared with TPPBASE.

4.3. Experimental results and evaluation

In general, the most popular metrics to evaluate the performance in information retrieval are precision and recall. Due tothe following reason, our experiments only apply the concept of recall, called coverage in this paper, to evaluate the result.For each query, the number of suitable statutes as selected by the legal professionals is usually not more than three. Since theanswer set for each query is very small, the precision rates, which are the number of statutes recommended by expertsdivided by the number of output statutes, are very low and difficult to articulate their relative performance. Therefore, thiswork uses coverage to measure the performance. Basically, the coverage is the percentage of experts’ recommended statutesthat are included within the top N statutes. The following equation is used to compute the coverage:

Coverage ¼ number of recommended statutes within the top N statutestotal number of statutes recommended legal professionals

N denotes the threshold on the number of output statutes.

4.3.1. Find the optimal combinationFig. 12 illustrates the coverage of TPPBASE. As can be seen from the graph, the third phase outperforms the first two

phases. In the beginning, within the top 3 statutes, the coverage was at 15.9% in the first phase, 34.1% in the second phase,and 50% in the third phase. When N was set to 13, the third phase reached the maximum 100%, while the first phase onlyreached 63.6%. Coverage reached 100% for all phases, however, when N = 28.

In Figs. 13–15, we present the coverage values for combinations with k1 = 40, 30, and 20, respectively. In general, perfor-mance improves when the value of k1 decreases. Fig. 13 clearly shows that by fixing k1 = 40 and varying the value of k2, asmaller k2 value results in a higher coverage rate.

Similarly, Fig. 14 reveals that by fixing k1 = 30 and varying the value of k2, a smaller k2 value results in a higher coveragerate. With k2 = 15, however, it could only reach a peak of 97.7% coverage in the third phase. This is because there was one

Table 3Combinations of k1 and k2.

k1 k2

40 302520

30 252015

20 1510

Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx 11

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 12: Predicting associated statutes for legal problems

recommended statute ranked behind the top 15 in the two queries. Consequently, the coverage could not be furtherimproved. Finally, as shown in Fig. 15, we checked the coverage of testing queries by fixing k1 = 20 and varying the valueof k2. As expected, the smaller the value of k2, the greater the coverage in the third phase. This trend changes after N > 8,however, when the coverage for a large k2 becomes greater than that of a small k2. The reason is because when N is large,it is easy for the third phase to find recommended statutes from the input statute set. Thus, to improve coverage, we must

0

20

40

60

80

100

120

3 5 6 7 8 9 10 11 12 13 14 15 20 25 28

Cove

rage

(%)

N, number of output statutes

Phase1 Phase2 (K1=28) Phase3 (K2=16)

Fig. 12. The coverage of TPPBASE.

0

20

40

60

80

100

120

3 5 6 7 8 9 10 11 12 13 14 15 20 25 30 40

Cove

rage

(%)

N, number of output statutesPhase1 Phase2 (K1=40) Phase3 (K2=30)

Phase3 (K2=25) Phase3 (K2=20)

Fig. 13. The coverage of k1 = 40.

0

20

40

60

80

100

120

3 5 6 7 8 9 10 11 12 13 14 15 20 25 30

Cove

rage

(%)

N, number of output statutesPhase1 Phase2 (K1=30) Phase3 (K2=25)

Phase3 (K2=20) Phase3 (K2=15)

Fig. 14. The coverage of k1 = 30.

12 Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 13: Predicting associated statutes for legal problems

make the input statute set of the third phase as complete as possible. Since a larger k2 means a larger statute set for phase 3, alarger k2 will result in better coverage than a smaller k2 when N is large.

In this experiment, we tested many combinations of k1 and k2 values to find the optimal combination. Table 4 summarizesthe results of k1 and k2 using the TPP approach. After comparing these combinations, we find that k1 = 28 and k2 = 16 resultsin the most complete information, since it reaches 100% coverage ratio for the top 13 statutes. On the other hand, althoughother combinations cannot reach 100% coverage, they are more compact and can reduce user workload when applying thesuggested statutes. One of these combinations is (k1 = 30, k2 = 15) with N = 11, which has a coverage of 97%.

4.3.2. ComparisonTo assess the performance of our custom-designed approach, the comparison methods used a classic TF-IDF scheme to

build vectors for testing documents. In accordance with three phases of TPP, we built two groups of algorithms for compar-ison, where the second group contains NGD transformation while the first group does not. In each group, there are threealgorithms, corresponding to executing phase 1, phase 1 + phase 2 and phase 1 + phase 2 + phase 3, respectively. When onlyexecuting phase 1, it correspond to executing SVM algorithm only. When executing phase 1 + phase 2, it correspond to firstexecuting SVM algorithm and then executing N-nearest neighbor algorithm.When executing all phases, it correspond to firstexecuting SVM algorithm, then executing N-nearest neighbor algorithm and finally executing association mining algorithm.For ease of references, these six algorithms are denoted as: (1) TF-IDF without NGD: Phase 1 (denoted by F1), Phase 1 + Phase2 (denoted by F2) and Phase 1 + Phase 2 + Phase 3 (denoted by F3); and (2) With NGD: Phase 1 (denoted by F4), Phase 1+ Phase 2 (denoted by F5) and Phase 1 + Phase 2 + Phase 3 (our TPP approach, denoted by F6). In addition, from Table 4,we chose the top 3 combinations of k1 and k2 to evaluate these six algorithms. Table 5 shows the performances of variousalgorithms on statute prediction. Compared with each algorithm in different combination of k1 and k2 for the top 3 statutes,the coverage values of F6 (i.e., TPP) were 52.3, 49.2 and 50.8 respectively. It is obvious that F6 outperforms other algorithmsin different combination of k1 and k2. Similarly, when N is between 5 and 10, F6 is also the best of these algorithms. From theresults in Table 5, we see that our proposed method is better than all the comparative methods.

Besides, we evaluated the performance of TPP method by comparing results using three state of the art retrieval func-tions: Cosine similarity, Pearson correlation coefficient and Spearman’s correlation coefficient. Table 6 summarizes theresults of different methods. Compared to the three state of the art functions, our proposed method TPP contributes tothe highest coverage rate when N is between 3 and 10. As we can see, the performance of the TPP method surpasses the otherthree methods. Therefore, we can conclude that the TPP approach provides an innovative method of statute prediction.

0

20

40

60

80

100

120

3 5 6 7 8 9 10 11 12 13 14 15 20 25 30

Cove

rage

(%)

N, number of output statutes

Phase1 Phase2 (K1=20) Phase3 (K2=15) Phase3 (K2=10)

Fig. 15. The coverage of k1 = 20.

Table 4Reported performances (coverage %) of k1 and k2.

k1 k2 Top N statutes

3 5 6 7 8 9 10 11 12 13 14 15 20 25 30

28 16 52.3 60.6 65.2 67.4 74.2 83.3 91.7 97 99.2 100 100 100 100 100 10040 30 21.2 43.9 50 53.8 60.6 65.2 65.9 74.2 83.3 87.9 95.5 98.5 100 100 10040 25 34.1 54.5 56.8 62.1 65.9 68.2 75.8 85.6 90.2 97.7 99.2 100 100 100 10040 20 38.6 56.1 59.1 65.9 68.9 76.5 87.1 92.4 95.5 98.5 100 100 100 100 10030 25 37.9 54.5 56.8 62.1 65.9 68.2 75.8 85.6 90.2 97.7 99.2 100 100 100 10030 20 38.6 56.1 59.8 65.9 68.9 78.8 87.9 92.4 95.5 97.7 100 100 100 100 10030 15 49.2 56.8 64.4 66.7 72.7 82.6 91.7 97 97 97.7 97.7 97.7 97.7 97.7 97.720 15 48.5 57.6 63.6 68.2 70.5 81.8 85.6 86.4 86.4 87.1 87.1 87.1 87.1 87.1 87.120 10 51.5 61.4 63.6 68.2 71.2 72.7 72.7 72.7 72.7 72.7 72.7 72.7 72.7 72.7 72.7

Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx 13

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 14: Predicting associated statutes for legal problems

5. Conclusions and future directions

With the increasing amount of social interaction in people’s daily lives, understanding related laws and regulations hasbecome more and more important. Due to a shortage in knowledge and legal background, the general public has found itdifficult to navigate automatic law consulting systems. To remedy this gap, this paper proposed an innovative methodnamed TPP (Three Phase Prediction) to assist the general public in the automatic retrieval of associated statutes. The trainingdata in the experiments were historical criminal judgments and statutes taken from the laws and regulations retrieval sys-tem of the Judicial Yuan in Taiwan. We selected 13 news stories from metropolitan civil news to use as queries. In addition,we proposed an evaluation with various combinations to choose proper k1 and k2 values for the TPP algorithm. The exper-imental results demonstrated that TPP can predict pertinent statutes with a coverage rate of 52.3% for the top 3 statutes,61.4% for the top 5 statutes, and 74.2% for the top 8 statutes. The three phase approach provides a solid framework to developefficient statute prediction algorithms. This paper marks just the beginning of this research line.

Future works can improve upon the performance of the current study by using data or text mining techniques to increasethe accuracy of the first or second phases. They can also work to improve the performance of the third phase, which finds anaccurate target from a number of candidates. Moreover, we plan to further extend the system so that it will also providepossible adjudications for the user’s legal problem. For another interesting issue, since similar factual scenarios might appearin many different areas of law, the mediation between factual specifics and a particular area of the law will be a potentialfuture research topic. Finally, the two parameters k1 and k2 are query independent. That is, their values are determinedbefore the query is issued. If their values can be determined dynamically in retrieval time, the retrieval performance willbe improved. To this end, our current solution approach must be extended so that it can decide the best values of k1 andk2 based on the characteristics/features of the query before performing the current TPP method.

Table 5Evaluation of statute prediction (coverage %).

k1 k2 Phase set Top N statutes

3 5 6 7 8 9 10

28 16F1 1.5 12.1 17.4 22 24.2 28 34.8F2 3 16.7 22 26.5 29.5 34.8 38.6F3 3.8 22 26.5 30.3 33.3 37.9 41.7F4 15.9 29.5 36.4 40.2 46.2 52.3 59.1F5 34.1 52.3 55.3 61.4 63.6 65.9 69.7F6 52.3 60.6 65.2 67.4 74.2 83.3 91.7

30 15F1 1.5 12.1 17.4 22 24.2 28 34.8F2 1.5 16.7 21.2 25.8 27.3 32.6 37.9F3 3 21.2 25.8 28.8 32.6 36.4 40.2F4 15.9 29.5 36.4 40.2 46.2 52.3 59.1F5 33.3 50.8 54.5 62.1 62.9 65.9 69.7F6 49.2 56.8 64.4 66.7 72.7 82.6 91.7

20 10F1 1.5 12.1 17.4 22 24.2 28 34.8F2 6.1 19.7 23.5 30.3 35.6 37.1 40.2F3 10.6 24.2 29.5 36.4 40.2 40.2 40.2F4 15.9 29.5 36.4 40.2 46.2 52.3 59.1F5 35.6 53.8 57.6 62.1 65.9 68.2 72.7F6 50.8 61.4 63.6 68.2 71.2 72.7 72.7

Table 6Evaluation of TPP and three state of the art retrieval functions (coverage %).

Algorithm Top N

3 5 6 7 8 9 10

TPP (k1 = 28, k2 = 16) 52.3 60.6 65.2 67.4 74.2 83.3 91.7Cosine similarity 15.2 28.8 31.1 33.3 37.9 40.2 41.7Pearson correlation coefficient 15.9 28.8 31.1 32.6 35.6 38.6 40.2Spearman’s correlation coefficient 8.3 20.5 26.5 28.8 30.3 34.1 38.6

14 Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 15: Predicting associated statutes for legal problems

Appendix A. An example to demonstrate TPP method

We use the following query as our example to present our TPP method. Suppose we have ten statues, and we set k1 to 7 inPhase 1, k2 to 5 in Phase 2 and finally retrieve Top 3 statutes in Phase 3. Fig. A.1 shows the procedures of how TPP approachworks in this query.

Query: 新竹警方查獲 KTV 羅姓女店員暗中抄下多名客人信用卡號和卡片背面認證碼到網路購物盜刷。 (HsinChu Police seizeda female KTV clerk named Luo, who secretly wrote down guests’ credit card numbers and authentication codes to conductmisappropriated online shopping).

From this query, we obtain five query terms, including 查獲 (seize), 暗中 (secretly), 信用卡 (credit card), 網路購物 (onlineshopping), and 盜刷 (misappropriated). Let t1 denote 查獲, t2 denote 暗中, t3 denote 信用卡, t4 denote 網路購物, and t5 denote盜刷.

A.1. Phase 1: Select top k1 statutes

In this phase, first of all, we need to do transformation between five query terms and the judgment term set (i.e., BASE),which is selected from the training judgments. Suppose there are ten judgment terms u1, u2, . . . ,u10 in BASE. As defined inSection 3.1, gi,j is the similarity between query term ti and judgment term uj. In Table A.1, we show the similarities betweenquery terms and judgment terms. Also, assume that the weights of query terms t1, t2, t3, t4, and t5 are all equal to 1. To trans-form a query term into multiple judgment terms, there are four steps to perform as demonstrated like Figs. 8 and 9 listed inSection 3.1. After transformation, the weights of judgment terms u1, u2, . . . ,u10 are 0.6, 0.4, 0.2, 0.1, 0.55, 0.8, 0.8, 0.2, 0.65, and0.45 respectively.

Batchprocess

Trainingjudgments

Preprocessing procedure

Statutes

SVMs training Mining associativestatute rules

Statutetransformation

Associative statuterules Statute vectorsClassification model

BASE

Judgment terms

Feature selection

Onlineprocess

Phase1 Query NGD

transformationStatutesprediction Top 7 statutes

Phase2

7 statute vectorsCosineSimilarity Top 5 statutes

Phase3 Statute weight

computation Top 3 statutesAssociative statuterules

1 query vectorNGD

transformation

Statutetransformation

Fig. A1. Batch and Online process of the example.

Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx 15

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 16: Predicting associated statutes for legal problems

Next, through the SVM classification model, the probabilities of ten statutes can be determined. Assume that the proba-bilities of 10 statutes S1, S2, S3, S4, S5, S6, S7, S8, S9, and S10, which are produced by SVMmodel, are 0.85, 0.8, 0.75, 0.7, 0.65, 0.6,0.55, 0.5, 0.45, and 0.4 respectively.

A.2. Phase 2: Select the top k2 statutes

From Phase 1, in accordance with the probabilities, the Top 7 statutes’ sequence is S1, S2 . . . ,S7. Assume that there are 12statute terms l1, l2, . . . , l12 generated from Top 7 statutes. Likewise, query terms can be mapped into the 12 statute terms ofTop 7 statutes using the transformation method described in Section 3.2. After transformation, query vector is built by theweights of statute terms in the statute. Afterward, we use the cosine similarity formula to compute the similarity betweenquery vector and statute vectors. Suppose in Table A.2, there are 7 statute vectors and one query vector. The cosine similarityvalue between query vector and Top 7 statute vectors of S1, S2, S3, S4, S5, S6, and S7 are 0.605228, 0.771517, 0.62361, 0.377964,0.600538, 0.251976, and 0.679921 respectively. Consequently, we found that the new order of statutes is S2, S7, S3, S1, S5, S4,and S6.

A.3. Phase 3: Select the final predicted statutes

Since k2 is 5, the output statutes of Phase 2 include S2, S7, S3, S1, S5. To acquire more precise relevant statutes, this phaseemployed the Apriori association algorithm to achieve this goal. The definition of SFW (Statute Final Weight) formula, whichis described in Section 3.3, is applied to determine the final sequence of statutes. Assume that we have a confidence matrix,shown in Table A.3, which presents the confidence of the associative statute rule si ? sj. After computation, the SFW values ofTop 5 statutes S2, S7, S3, S1, and S5 are 0.887866, 1.074924, 0.898879, 0.923447, and 1.019136 respectively. From the result,we can see that the sequence of the final Top 3 statutes is S2, S7, and S5.

Table A.1The similarity values of query terms to judgment terms.

Judgment term

u1 u2 u3 u4 u5 u6 u7 u8 u9 u10

Query term t1 0.5 0.5 0.4 0.6t2 0.3 0.2 0.5 0.4 0.1t3 0.7 0.2 0.6 0.5t4 0.5 0.8 0.7t5 0.2 0.4 0.5 0.8 0.1

Table A.2Top 7 statute vectors and query vector. (Similarity means the cosine similarity value between Query and Si).

Statute Statute term

l1 l2 l3 l4 l5 l6 l7 l8 l9 l10 l11 l12 Similarity

S1 0.30103 0.60206 0.30103 0 0.30103 0 0 0.60206 0.30103 0 0.30103 0 0.605228S2 0 0.30103 0.30103 0.30103 0 0.30103 0 0.30103 0.30103 0.30103 0 0.30103 0.771517S3 0.30103 0 0 0.30103 0.30103 0 0.30103 0 0 0.30103 0.30103 0 0.62361S4 0 0.30103 0 0.30103 0 0.60206 0 0.30103 0 0.30103 0 0.60206 0.377964S5 0 0 0.425969 0 0.30103 0 0.90309 0 0.425969 0 0.30103 0 0.600538S6 0.30103 0 0 0 0.30103 0 0 0 0 0 0.30103 0 0.251976S7 0 0.60206 0.425969 0 0.30103 0 0 0.60206 0.425969 0 0.30103 0 0.679921Query 0 0.30103 0.60206 0.60206 0.30103 0 0.30103 0.30103 0.60206 0.60206 0.30103 0

Table A.3Confidence matrix of si ? sj. All values are shown in%.

Consequent Sj (Top 5 statutes)

S1 S2 S3 S5 S7

Antecedent Si S1 80 70 70 80S2 80 65 70 75S3 70 75 75 80S4 60 80 70 75 80S5 65 75 65 80S6 75 80 70 90 75S7 60 80 60 85

16 Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 17: Predicting associated statutes for legal problems

References

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Wokingham, UK: Addison-Wesley.Bergholz, A., De Beer, J., Glahn, S., Moens, M.-F., Paaß, G., & Strobel, S. (2010). New filtering approaches for phishing email. Journal of Computer Security, 18(1),

7–35.Buckley, C. (1985). Implementation of the smart information retrieval system. Technical Report TR85-686, Department of Computer Science, Cornell

University, Ithaca, NY 14853.Calvo, R. A. (2001). Classifying Financial News With Neural Networks. In Proc. of the 6th Australasian document computing symposium.Chang, C. C., & Lin, C. J. (2001). LIBSVM: a library for support vector machines. <http://www.csie.ntu.edu.tw/~cjlin/libsvm> (Accessed 01.07.13).Chen, Chuan-hsi & Chi, Jeffery Y. P. (2010). Use text mining to generate the draft of indictment for prosecutor. PACIS 2010 proceedings (pp. 706–712).Chen, Y. L., & Chiu, Y. T. (2011). An IPC-based vector space model for patent retrieval. Information Processing and Management, 47(3), 309–322.Chen, L., Zhang, D., & Levene, M. (2013). Question retrieval with user intent. In Proceedings of the 36th international ACM SIGIR conference on Research and

development in information retrieval. July 28–August 01, Dublin, Ireland.Chen, Y. L., Liu, Y. H., & Ho, W. L. (2013). A text mining approach to assist the general public in the retrieval of legal documents. Journal of the American Society

for Information Science and Technology, 64(2), 280–290.Chou, S. C., & Hsing, T. P. (2010). Text mining technique for Chinese written judgment of criminal case. IEEE Intelligence and Security Informatics Conference,

113–125.Cilibrasi, Rudi L., & Vitanyi, Paul M. B. (2007). The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3), 370–383.CKIP (2013). On-line Chinese words segmented service. <http://ckipsvr.iis.sinica.edu.tw/> (Accessed 01.06.13).Clare, A., & King, R. D. (2001). Knowledge discovery in multi-label phenotype data. In Proceedings of the 5th European conference on principles of data mining

and knowledge discovery (pp. 42–53).Conrad, J. G., & Schilder, F. (2007). Opinion mining in legal blogs. In ICAIL ‘07 Proceedings of the 11th international conference on Artificial intelligence and law

(pp. 231–236).Diekema, A., Yilmazel, O., & Liddy, E. (2004). Evaluation of restricted domain question-answering systems. In The proceedings of the ACL workshop on question

answering in restricted domains (pp. 2–7). Stroudsburg, PA: Association for Computational Linguistics.Evangelista, A., & Kjos-Hanssen, B. (2006). Google distance between words. Frontiers in Undergraduate Research, Univ. of Connecticut.Feldman, R., & Sanger, J. (2007). The text mining handbook: Advanced approaches in analyzing unstructured data. New York, USA: Cambridge University Press.Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A. A., et al (2010). Building watson: An overview of the DeepQA project. AI Magazine, 31

(3), 59–79.Goldstein, J., Mittal, V., Carbonell, J., & Kantrowitz, M. (2000). Multi-document summarization by sentence extraction. In Proceedings of the 2000 NAACL-

ANLPWorkshop on Automatic summarization (pp. 40–48). April 30–30, Seattle, Washington.Gomez-Perez, A., Ortiz-Rodriguez, F., & Villazon-Terrazas, B. (2007). Ontology-based legal information retrieval to improve the information access in e-

government. In Proceedings of the 15th international conference on World Wide Web (pp. 1007–1008).Hotho, A., Nürnberger, A., & Paaß, G. (2005). A brief Survey of text mining. LDV-Forum GLDV Journal for Computational Linguistics and Language Technology, 20

(1), 19–62.Hsu, H. H., Chen, Y. F., Lin, C. Y., Hsieh, C. W., & Shih, T. K. (2012). Emotion care services with Facebook wall messages. In The 26th international conference on

advanced information networking and applications workshops (pp. 875–880).Judicial Yuan (2013). Law and Regulations Retrieving System of the Judicial Yuan of The Republic of China. <http://jirs.judicial.gov.tw/Index.htm> (Accessed

05.05.13).Katz, B. (1997). Annotating the WorldWideWeb using natural language. In Proceedings of the 5th RIAO conference on computer assisted information searching

on the internet (pp. 136–159). Quebec, Canada: Montreal.Kaur, J., Yusof, M., Boursier, P., & Ogier, J.-M. (2010). Automated scientific document retrieval. In The 2nd international conference on computer and automation

engineering, ICCAE 20105 (pp. 732–736).Kawai, H., Jatowt, A., Tanaka, K., Kunieda, K., & Yamada, K. (2011). Query expansion and text mining for chronoseeker-search engine for future/past events.

IEICE Transactions on Information and Systems, E94-D(3), 552–563.Lee, M., Cimino, J., Zhu, H., Sable, C., Shanker, V., Ely, J., et al. (2006). Beyond information retrieval–medical question answering. In Proceedings of the AMIA

annual symposium (pp. 469-473). Washington, DC.LexisNexis (2013). A web-based legal database system. <http://www.lexisnexis.com/> (Accessed 15.04.13).Li, X., Du, L., & Shen, Y. D. (2013). Update summarization via graph-based sentence ranking. IEEE Transactions on Knowledge and Data Engineering, 25(5),

1162–1174.Li, Y. J., Luo, C., & Chung, S. M. (2008). Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, 20

(5), 641–652.Li, N., & Wu, D. D. (2010). Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems, 48(2),

354–368.Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering,

17(4), 491–502.Lochbaum, K. E., & Streeter, L. A. (1989). Combining and comparing the effectiveness of latent semantic indexing and the ordinary vector space model for

information retrieval. Information Processing and Management, 25(6), 665–676.L’opez-Moreno, P., Ferr’andez, A., Roger, S., & Ferr’andez, S. (2007). The problems in a Question Answering system in the academic domain. <http://rua.ua.es/

dspace/bitstream/10045/4297/1/ranlp07.pdf>.Lucene (2014). Apache Lucene. <http://lucene.apache.org/core//> (Accessed 11.03.14).Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 39–41.Moens, M. F. (2001). Innovative techniques for legal text retrieval. Artificial Intelligence and Law, 29–57.Moens, M. F. (2005). Combining structured and unstructured information in a retrieval model for accessing legislation, In ICAIL ‘05 Proceedings of the 10th

international conference on Artificial intelligence and law (pp. 141–145).Nie, L., Wang, M., Zha, Z., Li, G., & Chua, T. S. (2011). Multimedia answering: enriching text QA with media information. In Proceedings of the 34th international

ACM SIGIR conference on research and development in information retrieval, July 24–28. Beijing, China.Reyes, A., Rosso, P., & Buscaldi, D. (2012). From humor recognition to irony detection: The figurative language of social media. Data & Knowledge Engineering,

74, 1–12.Ribeiro, M. N., Neto, M. J. R., & Prudêncio, R. B. C. (2008). Local feature selection in text clustering. In 15th ICONIP (pp. 45–52). Springer.Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4),

333–389.Rogati, M., & Yang, Y. (2002). High-performing feature selection for text classification. CIKM’02 (pp. 659–661).Salton, G. (1989). Automatic Text Processing. USA: Addison-Wesley.Salton, G., Allan, J., & Buckley, C. (1994). Automatic structuring and retrieval of large text files. Communications of the ACM, 37(2), 97–108.Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. New York, USA: McGraw-Hill.Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.Schumaker, R., Zhang, Y., Huang, C., & Chen, H. (2012). Evaluating sentiment in financial news articles. Decision Support Systems, 53(3), 458–464.

Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx 17

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003

Page 18: Predicting associated statutes for legal problems

Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3),538–556.

Thomaidou, S., & Vazirgiannis, M. (2011). Multiword keyword recommendation system for online advertising. In Proceedings of 2011 international conferenceon advances in social networks analysis and mining (pp. 423–427).

Tikk, D., Biró, G., & Törcsvári, A. (2007). A hierarchical online classifier for patent categorization. In Emerging technologies of text mining: Techniques andapplications (pp. 244–267).

Trappey, A. J. C., & Trappey, C. V. (2008). An R&D knowledge management method for patent document. Industrial Management and Data Systems, 108(1–2),245–257.

Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. Data Mining and Knowledge Discovery Handbook, 667–685.Udn News Net (2013). Union daily news. <http://udn.com/NEWS/mainpage.shtml> (Accessed 25.06.13).Wang, D., Zhu, S., Li, T., & Gong, Y. (2009). Multi-document summarization using sentence-based topic models. In Proceedings of the ACL-IJCNLP 2009

conference short papers. Association for computational linguistics (pp. 297–300).Wang, K., Ming, Z. Y., Hu, X., & Tat-Seng Chua, T. S. (2010). Segmentation of multi-sentence questions: towards effective question retrieval in cQA services. In

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, July 19–23. Geneva, Switzerland.Wang, J., Wang, B., Duan, L. Y., Tian, Q., & Lu, H. (2011). Interactive ads recommendation with contextual search on product topic space.Multimedia Tools and

Applications, 1–22.Wei, C. P., Chen, H. C., & Cheng, T. H. (2008). Effective spam filtering: A single-class learning and ensemble approach. Decision Support Systems, 45(3),

491–503.Westlaw (2013). A web-based legal information database system. <http://international.westlaw.com/> (Accessed 15.04.13).Witten, I. H., & Frank, E. (2011). Data mining: Practical machine learning tools and techniques (3rd ed.). San Francisco, CA: Morgan Kaufmann.Yin, H. (2007). Method and system of knowledge based search engine using text mining. Google Patents, US Patent 7257530.Zheng, Rong, Li, Jiexun, Chen, Hsinchun, & Huang, Zan (2006). A Framework for authorship identification of online messages. Journal of the American Society

for Information Science and Technology, 57(3), 378–393.Zheng, W., Milios, E., & Watters, C. (2002). Filtering for medical news items using a machine learning approach. AMIA Annual Symposium Proceedings,

949–953.

18 Y.-H. Liu et al. / Information Processing and Management xxx (2014) xxx–xxx

Please cite this article in press as: Liu, Y.-H., et al. Predicting associated statutes for legal problems. Information Processing and Management(2014), http://dx.doi.org/10.1016/j.ipm.2014.07.003