Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.
-
Upload
gwendoline-bradford -
Category
Documents
-
view
213 -
download
0
Transcript of Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.
![Page 1: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/1.jpg)
Learning to Classify DocumentsLearning to Classify DocumentsEdwin ZhangEdwin Zhang
Computer Systems Lab 2009-Computer Systems Lab 2009-20102010
![Page 2: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/2.jpg)
IntroductionIntroduction
Classifying documents Will use a Bayesian method and calculate conditional probabilityUse a set of Training Documents Choose a set of features for each categoryCoding in Java
![Page 3: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/3.jpg)
BackgroundBackgroundNaïve Bayes Classifier/Bayesian Methodcomputes the conditional probability p(T|D) for a given document D for every topic Assigns the document D to the topic with the largest conditional probability
http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
![Page 4: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/4.jpg)
BackgroundBackground
Program has two steps:LearningPrediction
![Page 5: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/5.jpg)
LearningLearning
Will be using training documents
conditional probability
features selection based on how often terms appear in certain documents
http://www.dot.state.mn.us/consult/images/j0341469.jpg
![Page 6: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/6.jpg)
PredictionPrediction
PredictionPredicting what a
unknown document is talking about based on the learning section
http://www.deafsports.co.nz/WebImages/documents.jpg
![Page 7: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/7.jpg)
DevelopmentDevelopment
Created Category, Document, Terms classes– Category class deals with the categories– Document class deals with the documents– Terms class deals with terms that appear in each
document
![Page 8: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/8.jpg)
CategoryCategory
Each category contains an array of documents
My categories started out with tennis and other
Added more categories as my program started working
![Page 9: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/9.jpg)
Document ClassDocument Class
Each document contains an array of terms.
The documents were my training documents
![Page 10: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/10.jpg)
Terms ClassTerms Class
Terms class dealt with all the terms that appeared in the training documents
For each term, an array of counts on the number of times the term appears in documents– Counts for each category
Also, each term is assigned a score– Score = number of times in category A + 1/number of
times in category B + 1 to avoid dividing by 0– Method to calculate the score varied as my program
developed Terms
![Page 11: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/11.jpg)
Development (continued)Development (continued)
Creates an array of categories Reads in all my training documents Stores all the terms that appear in an array of
Terms Sorts the array of terms based on the score for each
category Chose the top 25 terms from the sorted array based
on each category
![Page 12: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/12.jpg)
Development (continued)Development (continued) What I still need to do:
– Test my program's learning and write the prediction part
– Once my program works for two categories, add more categories
http://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/text.111/b28303/img/ccapp018.gif
![Page 13: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/13.jpg)
Expected ResultsExpected ResultsThe more training documents, the better the results will likely be In addition, different ways of calculating score will likely produce different results
May play around with thatExpected results
![Page 14: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/14.jpg)
DiscussionDiscussion
Once my program starts running and working correctly, I will discuss the results
I have finished the Learning part of the program, but now I need to do the Prediction part
![Page 15: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/15.jpg)
Works CitedWorks Cited
http://www.nltk.org/book My dad Chai, Kian Ming Adam, Hai Leong Chieu,
and Hwee Tou Ng. ACM Poral. Assocation of Computing Machinery, 2002. Web. 14 Jan. 2010. <http://portal.acm.org/citation.cfm?id=564376.564395&coll=Portal&dl=ACM&CFID=70884224&CFTOKEN=94712991>.
![Page 16: Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.](https://reader035.fdocuments.us/reader035/viewer/2022070412/5697bf7a1a28abf838c83131/html5/thumbnails/16.jpg)
Works Cited (continued)Works Cited (continued)
Eyheramendy, Susana, and David Madigan. "A Flexible Bayesian Generalized Linear Model for Dichotomous Response Data with an Application to Text Categorization." Lecture Notes-Monograph Series 54 (2007): 76-91. JSTOR. Web. 25 Oct. 2009. <http://www.jstor.org/stable/20461460>.
Lavine, Michael, and Mike West. "A Bayesian Method for Classification and Discrimination." Canadian Journal of Statistics 20.4 (1992): 451-461. JSTOR. Web. 14 Jan. 2010. <http://www.jstor.org/>.