Project 2 CS652. Project2 Presented by: REEMA AL-KAMHA.
-
date post
20-Dec-2015 -
Category
Documents
-
view
220 -
download
1
Transcript of Project 2 CS652. Project2 Presented by: REEMA AL-KAMHA.
Project 2
CS652
Project2
Presented by:
REEMA AL-KAMHA
Results
• VSM model:– The training set contains 18 documents (10 positive, 8 negative) .
• Ontovector(1 , 0.91 , 1 , 0.85 , 0.90 , 1.16 , 0.33) corresponds to (Type, GoldAlloy , Price, Diamondweight, MetalKind, Jem,JemShap)
• Threshold=0.7
– Testing set contains 20 documents (10 positive, 10 negative)• Recall=1 , Presion=0.8
– Testing set for instructor 24 documents (1positive, 23 negative)
• Recall=1 , Presion=1
Results
• NB Model:– The training set contains 18 documents (10 positive, 8 negative) .
– Testing set contains 20 documents (10 positive, 10 negative)• Recall=1 , Precision=0.7
– Testing set for instructor 24 documents (1positive, 23 negative)
• Recall=1 , Precision=0.5
Comments
• VSM Model:– the average of each attribute= the number of occurrence of
the attribute/the number of records.
• NB Model:– For vocabulary document remove all stop-words.
• In the result I always have Recall=1 which means the process does not discard any relevant document.
Documents classification• Two ways have been implemented in Java. - VSM– Vector Space Model. - NB – Naive Bayes.• Applying VSM on my domain –Books- was not without
problems. The problems basically because of the meaning of the title and the author. For e.g., when trying to apply VSM on cars, there are some thing needed to be figured out such do we consider the model of the car as title and the make as author? Of course such assumption made some troubles since some of irrelevant documents became relevant.
Muhammed
So the philosophy was to ignore the title and author and use the other attributes to judge if the document is relevant or not. You can see from the table that the car almost about to attain the threshold.
document Title Auth. Price Year ISBN cosine
PC 0 0 7 23 0 0.551
Books 0 0 15 15 15 0.779
Digital Camera
0 0 12 10 0 0.453
Cars 0 0 38 29 0 0.748
Recall =100% Note: when we take the title and the author Precision=100% into consideration the threshold Threshold= 76% becomes 0.999.
The threshold for Books domain
0.755
0.76
0.765
0.77
1 2 3 4 5
number of the documents
thre
sh
old
ran
ge
• Other documents similarity:
• Drug = 0.435• Real_estate=0.599• Computer=0.423
Naïve BayesDocument Result Precision and recall
Books1* Relevant Precision: 100%
Recall: 100%
* From the same
website.From the provided
test cases.
Books2* Relevant
Books3 Relevant
Cars Irrelevant
Drugs Irrelevant
PC Irrelevant
Digital Camera Irrelevant
Real Estate Irrelevant
Jewelry Irrelevant
Computer Irrelevant
Conclusion
• Both of the implemented ways are efficient.• VSM is easier to implement and faster.• Much time spent because I misunderstood NB
algorithm– this was my problem.• When amplifying some key attributes that is
almost unique to a domain, 100% precision and recall is very possible.
• NB is not very sensitive to the boundary values.
Tim Chartrand Project 2 Results
• Application Domain:– Software (Shareware
and Freeware)
• Size of training set– Positive: 10– Negative: 10
VSM ResultsSize Precision Recall F-Measure
Test + 10 100% 100% 100%
Test - 10 100% 100% 100%
TA Test + 1 100% 100% 100%
TA Test - 22 100% 100% 100%
Total 43 100% 100% 100%
VSM Improvements
• Normalize positive training example results to find the per record expected values
• Add a weight to each attribute:– Epos(i) = expected value for attribute i in positive examples
– Eneg(i) = expected value for attribute i in negative examples
– Diff(i) = Epos(i) - Eneg(i)
– Weight(i) =Diff(i) / maxj=1..n Diff(j)
– Ontology(i) = Epos(i) * Weight(i)
• Weighting results:– Average difference improved from 0.587 to 0.714
– Separation improved from 0.280 to 0.422
– Price given a weight of 0 – not considered in document classification
Bayes Results & Improvements
• Improvements – Reduce vocabulary to “best” words:– Eliminate stopwords
– Stemm common prefixes and suffixes
– Ignore case
– Eliminate numbers
– Remove non-alphabetic characters before and after a word
Bayes ResultsSize Precision Recall F-Measure
Test + 10 100% 100% 100%
Test - 10 100% 100% 100%
TA Test + 1 100% 100% 100%
TA Test - 22 100%*** 100% 100%
Total 43 100% 100% 100%
*** Somewhat artificial result. I started out at about 10% Precision and added negative training examples until I correctly classified all test examples.
VSM Vs. Bayes
• End results were the same, but …– VSM performed better using only the original dataset
– Bayes seems to need more training data (mainly negative)
• Major advantage of VSM – Clustering:– Using the ontology as a vector allowed effective
clustering of similar data items (i.e. dates, prices, etc)
– Reduced dimensionality from about 1500 to 8
Text Classification
Helen Chen
CS652 Project 2
May 31, 2002
Documents and Methods
• Application: movie
• Documents:– Training Set:
• Positive Docs: 5; Negative Docs: 24
– My Test Set:• Positive Docs: 5; Negative Docs: 14
• Methods: VSM model and NB
Vector Space Model and Naïve Bayes
• VSM: threshold is 0.65
• NB
My Test Set
TA Test Set
Precision 100% 100%
Recall 100% 100%
My Test Set
TA Test Set
Precision 100% 100%
Recall 100% 100%
Results tested on my own testing set and instructor-provided testing set
(23 negative docs, 1 positive docs) for VSM model (left) and NB (right)
Comments on VSM
• Weighting is critical to performance– Assign weight according to positive examples– Adjust weight according to negative examples
Title MPAA Rating
Length Release Date
Director Genre
weight 1 1 1 0.7 1 2.4
Weights assigned on each attribute
Comments on NB
My Test Set
TA Test Set
Precision 100% 100%
Recall 100% 100%
• The choice of irrelevant documents in training set is critical to the performance
My Test Set
TA Test Set
Precision 100% 100%
Recall 60% 0%
Results for “Clustered” training set Results for evenly distributed training set
Yihong’s Project2
• Target topic: Apartment Rental
• Training Sets – 5 positive, 10 negative
• Testing Sets– Self sets: 5 positive, 9 negative– TA sets: 1 positive, 23 negative
VSM Results
• 100% Precision and Recall for both self-collected sets and TA-collected sets
• Threshold Value: 0.868• Most similar application
– Real Estate, range: 0.792~0.846– Compare with AptRental, range: 0.891~0.937
• Weighting attributes– Precision-weighted– Recall-weighted– F-measure-weighted
Naïve Bayes Results
• 100% Precision and Recall for both self-collected sets and TA-collected sets
• Summation instead of production
– to avoid the problem of underflow
More Comments• Machine cannot know what is unknown
– training examples must be representative
• Estimate of prior probability of target values is very important– 50% estimate to 4.2% “real” distribution is undesired, precision is 25%– 33% estimate, achieve 100%, over 50% irrelevant cases pos – neg < 3
• Cluster special attributes, like phone number, price, etc. (similar thinking as our ontology)
• Distributional clustering– should work fine because of low noisy level for semi-structural
documents
David Marble
CS 652 Spring 2002
Project 2 – VSM/Bayes
Results (My Test Data) RECALL PRECISIONVSM 8/10 10/10 80% 100%Failed on: Classified ads, car ads with a lot of info.
Bayes 9/10 10/10 90% 100%Failed on: Missed one restaurant page! That page had no food description and
city names from outside my training set. FoodType was the key. (Not too many extraneous documents have the words “mexican, fish, BBQ, chinese,” etc. These words show up on average just over 2 times per record in the positive training documents.)
Results (BYU Data) RECALL PRECISIONVSM 20/24 24/24 83% 100%Failed on: Cars, Apartments, Shopping and Real Estate. Lots of phone #’s,
addresses, cities and states – a name is a given (how can you distinguish what a restaurant name is?
Bayes 24/24 24/24 100% 100%Failed on: Nothing. Once again, FoodType was the key. Luckily, the one
applicable document had food type listed.
Comments
• Training data contained State & Zip only half the time.
• Names of restaurants could not be a specific term, therefore just about every record had a “restaurant name.”
• Mainly did well with Naïve Bayes because of FoodType extraction – average of over 2 per record in training data and covered most of the possible food terms.
VSM and Bayes search results
Lars Olson
My Test Data• 5 positive, 6 negative (including obituaries)• VSM:
– Using 83% threshold:• Precision: 4/5 = 80%
• Recall: 4/5 = 80%
– Using 80% threshold: (accepts one training doc incorrectly)• Precision: 5/6 = 83.3%
• Recall: 5/5 = 100%
• Bayes:• Precision: 5/5 = 100%
• Recall: 5/5 = 100%
TA Test Data
• 1 positive, 23 negative (including obituaries)• VSM:
• Precision: 1/2 = 50%
• Recall: 1/1 = 100%
• Bayes:• Precision: 1/1 = 100%
• Recall: 1/1 = 100%
Comments
• Obituaries vs. genealogy data?– Rejected by Bayes, but obituary examples in training
set could affect that– Changes VSM to 100% precision and recall for both
test sets at 80% threshold (although one training doc is still accepted incorrectly)
• Incomplete lexicons• High variance (Gender: 0.7% to 100%, Place:
0% to 84.3% in training documents)• Zero vector undefined in VSM
Craig Parker
My Results
• VSM– cut-off value 0.85– 100% correct
• Bayesian– Classified everything as a non-drug
DEG Results
• VSM– 100 % Correct using predetermined cutoff
value of 0.85 (I think)
• Bayesian– Identified everything as negative (although the
margin was smaller on drug than on non-drugs)
Comments
• VSM worked very well for drugs.– Would have been even better with a cleaner
dictionary of drug names.– Dose and Form were the most important
distinguishers
• Something wrong with my Bayesian calcuations
Project 2 - Radio Controlled Cars
Jeff Roth
Results - My Tests
VSM Classification Naïve Bayes Classification VSM AND NB
RC Universe Yes Yes YesRC Web Board Yes Yes Yes
RC-X Yes Yes Yes
RCMT Yes Yes Yes
RC Old Yes Yes Yes
Ebay RC Yes Yes Yes
Stormer Racing Yes Yes Yes
RC Racing Yes Yes Yes
Jobs No No No
Obituaries No No NoReal Estate Yes No NoBooks No Yes NoCampgrounds No No No
Digital Cameras Yes Yes Yes
Cell Phone Plan No Yes No
Drugs No Yes No
RC Boats No Yes No
RC Planes No Yes NoRecall / Precision 89% 67% 94%
VSM Classification Naïve Bayes Classification VSM AND NBAptrental No No NoBook No Yes Nocameraaccessories No Yes No
Campground No No No
Car No No NoCd No Yes NoCellphoneplan No No NoDigitalcamera Yes Yes YesDrug No Yes NoDrugfreeschedule No Yes NoEbay.car No No NoGems No Yes NoGenealogy No No NoHardware No Yes NoJewelry No No NoListbooks Yes No No
Movie No Yes Noobituary No No Nophonemanufactures No No Norcc Yes Yes YesRealestate No No No
Restaurant No No No
Shopping No No No
Software No No No
Recall / Precision 92% 63% 96%
Comments
• Digital Camera always positive, even out scored RC Car adds on VSM - lots of matches on battery and charger
• Both algorithms had trouble with very unrelated documents - docs where almost no term matches found
• Naïve Bayes had most trouble when test set wasn’t similar to RCCars or any of the documents used in the training set
• Combining VSM with NB using a logical AND was very successful
VSM resultsTest1
(T.VSM)
Test2
(T.VSM)
Test1
(Ont. VSM)
Test2
(Ont.VSM)
Recall 80% 100% 100% 100%
Precision 100% 100% 100% 100%
• Weight i = n+i / N+ - n-i / N-
Sim Distribution
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25 30 35
Files
Sim
Weight i = n+i / N+
Threshold = avg(sim(+)) – avg(sim(-)) ~ 0.61
Sim Distribution
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40
Files
Sim
Traditional VSM vs. Onto. VSM
• Consider not only attributes, but values
• Achieve keyword clustering
• Find a way that can automatically and efficiently define the query words
Sim Distribution (Tradtional VSM)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35
Files
Sim
Train (-)
Test1 (+)
Test1 (-)
Test2 (+)
Test2 (-)
Train (+)
Sim Distribution
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40
Files
Sim
Train (-)
Test1 (+)
Test1 (-)
Test2 (+)
Test2 (-)
Train (+)
Sim Distribution
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25 30 35
Files
Sim
Train (-)
Test1 (+)
Test1 (-)
Test2 (+)
Test2 (-)
Train (+)
Naïve Bayes ResultsPrecision Recall
Test1 100% 100%
Test2 100% 100%
Bayes:• Requires relevant large
number of training set, especially for the (-) Set
• Requires good distribution of the training set
Training Test1 Test2
(+) 10 5 1
(-) 30 5 23
Improvement:• Eliminate Stopwords
(obtained from: http://www.oac.cdlib.org/help/stopwords.html
• Ignore case
Conclusion
• Both work fine
• Naïve Bayes: More picky to training set, but not depend on the pre-defined keyword or the ontology
• VSM: Application dependent, perform better, provide relevance rank
Finding documents about campgrounds
Alan Wessman
Results for My Test Set
• VSM:– Precision: 100%– Recall: 100%– F-measure: 100%– Classification
threshold value = 0.660
• Naïve Bayes:– Precision: 86%– Recall: 100%– F-measure: 92%
Results for Class Test Set
• VSM:– Precision: 20% (1/5)– Recall: 100% (1/1)– F-measure: 33%
• Naïve Bayes:– Precision: 20% (1/5)– Recall: 100% (1/1)– F-measure: 33%
Observations
• Calculating precision in NB: product of many small probabilities becomes zero
• NB: Accuracy affected by number and percentage of tokens found in vocabulary
• VSM: Accuracy strongly affected by how similarly the different documents “support” the ontology
• VSM: Choosing a higher threshold (0.730) would have given F = 75% for my test set and F = 66% for the class test set
Text ClassificationCS652 Project #2
Yuanqiu (Joe) Zhou
Vector Space Model• Query Vector (based on 34 records)
– constructed by a document with 34 records– Brand (1.0), Model (1.0), CCDResolution (1.0), ImageResolution
(0.65), OpticalZoom (1.0), DigitalZoom (0.88)
• Threshold 0.92– Obtained by computing the similarities of two relevant
documents(0.99, 0.98)and two similar documents(0.74, 0.83) to the query
• Document Vectors– Self-collected
• 5 positive (> 0.97) and 5 negative (< 0.89)• Recall = 100% and Precision = 100%
– TA-proivded• Positive( = 0.99) Negative(< 0.58)• Recall = 100% and Precision = 100%
Naïve Bayes Classifier• Training Set
– 20 positive– 28 negative (20 of them very similar)
• Testing Set
– Self-collected• 10 positive• 15 negative (10 of them very similar)
• |Ra| = 8, R = 10, A = 12,
• Recall = 80%, Precision = 66%
– TA-provided• |Ra| = 1, R = 1, A = 1
• Recall = 100%, Precision = 100%
Comments
• VSM model results in high recall and precision if and only if onto demo can extract desired value correctly
• The original Naïve Bayes Classifier has trouble to classify some pages in special cases and needs to be fine tuned in some ways (stop words, positive word density, etc)