USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
-
Upload
clinton-rogers -
Category
Documents
-
view
215 -
download
1
Transcript of USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
![Page 1: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/1.jpg)
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE
Group 7 – MEI, Yan & HUANG, Chenyu
![Page 2: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/2.jpg)
Outline• Background & Challenge• Preprocessing
• Lemmatization• TF-IDF
• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest
• Evaluation• Conclusion
![Page 3: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/3.jpg)
Outline• Background & Challenge• Preprocessing
• Lemmatization• TF-IDF
• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest
• Evaluation• Conclusion
![Page 4: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/4.jpg)
Background• A problem from Kaggle• Predict the category of cuisine from the recipe ingredients
• pasta -> Italian, kimchi -> Korean, curry -> Indian
![Page 5: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/5.jpg)
Challenge• Multi-label classification• 6715 features
• If we use binary label for every ingredient in each recipe, the train data will be too large.
• Huge number of labels to train• Quite different from ‘Yes’ or ‘No’ label.
• class-imbalanced• the Italian and Indian food dominate the whole recipe while we
could seldom see one or two cuisine called “cajun_creole”
![Page 6: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/6.jpg)
![Page 7: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/7.jpg)
Outline• Background & Challenge• Preprocessing
• Lemmatization• TF-IDF
• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest
• Evaluation• Conclusion
![Page 8: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/8.jpg)
Lemmatization• Characters hard to handle in the data set
• ™ and ® - delete it, do not influence the result.• French character(é, ù) – replace it by a similar English character,
guarantee the word is unique in the features after replacing.
• Plural form• eggs and egg.• NLTK(Natural Language Toolkit) – lemmatize the word according to
the dictionary in toolkit.
![Page 9: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/9.jpg)
TF-IDF• The problem is similar to label document according to the
content in the document.• Term Frequency–Inverse Document Frequency(TF-IDF),
• a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
• Using the raw frequency of a term• TF(t) means the number of times that term t occurs in content.
• After lemmatization and TF-IDF, we reduce feature from 6715 to 2774
![Page 10: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/10.jpg)
Outline• Background & Challenge• Preprocessing
• Lemmatization• TF-IDF
• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest
• Evaluation• Conclusion
![Page 11: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/11.jpg)
k-NN• scikit-learn implements two different nearest neighbors
classifiers:• KNeighborsClassifier: implements learning based on the k
nearest neighbors of each query point, where k is an integer value specified by the user.
• RadiusNeighborsClassifier: implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user.
• We choose the 1st classifier, set k = 1.• The result should be taken as basic standard of all
classifers’ performance
![Page 12: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/12.jpg)
Naive Bayes• The multinomial Naive Bayes classifier is suitable for
classification with discrete features (e.g., word counts for text classification). In practice, fractional counts such as tf-idf may also work.
• Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.
• Performed better than expected: Attributes are relatively independent compared with word vectors in text.
![Page 13: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/13.jpg)
Parameters
• Default alpha = 1• We set alpha = 0.01 for N(N<1) is much smaller than n
![Page 14: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/14.jpg)
Linear Support Vector Classification• The advantages of support vector machines are:
• Effective in high dimensional spaces.• Uses a subset of training points in the decision function (called
support vectors), so it is also memory efficient.• Versatile: different Kernel functions can be specified for the
decision function. Common kernels are provided, but it is also possible to specify custom kernels.
• The disadvantages of support vector machines include:• If the number of features is much greater than the number of
samples, the method is likely to give poor performances.• SVMs do not directly provide probability estimates, these are
calculated using an expensive five-fold cross-validation.
![Page 15: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/15.jpg)
Linear Support Vector Classification
Multiclass support is handled according to One-Vs-All scheme
Radial Basis Function kernel
![Page 16: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/16.jpg)
Parameters• Default parameters
• Penalty parameter C of the error term is 1.0• Dual = true
• We set• C = 0.8• Dual = false
![Page 17: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/17.jpg)
Logistic Regression Classification• Logistic regression is also known in the literature as logit
regression, maximum-entropy classification (MaxEnt) or the log-linear classifier.
• In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
![Page 18: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/18.jpg)
Parameters• We use GridSearchCV to find the best parameters.
![Page 19: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/19.jpg)
Random Forest• A random forest is a meta estimator that fits a number of
decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.
• A diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.
![Page 20: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/20.jpg)
Parameters• By default, the number of trees in the forest is 10• We set the number of trees in the forest to 100
• More trees will cover more features.• The larger the better, but also the longer it will take to compute.
![Page 21: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/21.jpg)
Outline• Background & Challenge• Preprocessing
• Lemmatization• TF-IDF
• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest
• Evaluation• Conclusion
![Page 22: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/22.jpg)
Evaluation Setup• Python 3.3 for windows• Two library
• NLTK(Natural Language Toolkit)• Scikit-learn
• Evaluation metric• Accuracy• Time
![Page 23: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/23.jpg)
Accuracy
LSVC LR NaiveBayes KNN RandomForest0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.80.78811 0.78922
0.73431
0.7134
0.75422
0.69479
0.77967
0.68292
0.70062
Acurracy w/ custom parameters Acurracy w/ default parameters
![Page 24: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/24.jpg)
Time
• The time for 1NN is longer than 5 hours
LSVC LR NaiveBayes RandomForest0
20
40
60
80
100
120
140
19.53
79.14
0.8
118.27
12.05
75.5
0.8412.56
Time w/ custom parameters Time w/ default parameters
![Page 25: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/25.jpg)
Outline• Background & Challenge• Preprocessing
• Lemmatization• TF-IDF
• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest
• Evaluation• Conclusion
![Page 26: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bfa11a28abf838c95a5c/html5/thumbnails/26.jpg)
Conclusion• The preprocessing step dramatically save the execution
time.• Different parameter will significantly affect the result• Considering both accuracy and time, Linear SVC is the
best choice.