USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

26
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu

Transcript of USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Page 1: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE

Group 7 – MEI, Yan & HUANG, Chenyu

Page 2: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Outline• Background & Challenge• Preprocessing

• Lemmatization• TF-IDF

• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest

• Evaluation• Conclusion

Page 3: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Outline• Background & Challenge• Preprocessing

• Lemmatization• TF-IDF

• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest

• Evaluation• Conclusion

Page 4: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Background• A problem from Kaggle• Predict the category of cuisine from the recipe ingredients

• pasta -> Italian, kimchi -> Korean, curry -> Indian

Page 5: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Challenge• Multi-label classification• 6715 features

• If we use binary label for every ingredient in each recipe, the train data will be too large.

• Huge number of labels to train• Quite different from ‘Yes’ or ‘No’ label.

• class-imbalanced• the Italian and Indian food dominate the whole recipe while we

could seldom see one or two cuisine called “cajun_creole”

Page 6: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Page 7: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Outline• Background & Challenge• Preprocessing

• Lemmatization• TF-IDF

• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest

• Evaluation• Conclusion

Page 8: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Lemmatization• Characters hard to handle in the data set

• ™ and ® - delete it, do not influence the result.• French character(é, ù) – replace it by a similar English character,

guarantee the word is unique in the features after replacing.

• Plural form• eggs and egg.• NLTK(Natural Language Toolkit) – lemmatize the word according to

the dictionary in toolkit.

Page 9: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

TF-IDF• The problem is similar to label document according to the

content in the document.• Term Frequency–Inverse Document Frequency(TF-IDF),

• a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

• Using the raw frequency of a term• TF(t) means the number of times that term t occurs in content.

• After lemmatization and TF-IDF, we reduce feature from 6715 to 2774

Page 10: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Outline• Background & Challenge• Preprocessing

• Lemmatization• TF-IDF

• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest

• Evaluation• Conclusion

Page 11: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

k-NN• scikit-learn implements two different nearest neighbors

classifiers:• KNeighborsClassifier: implements learning based on the k

nearest neighbors of each query point, where k is an integer value specified by the user.

• RadiusNeighborsClassifier: implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user.

• We choose the 1st classifier, set k = 1.• The result should be taken as basic standard of all

classifers’ performance

Page 12: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Naive Bayes• The multinomial Naive Bayes classifier is suitable for

classification with discrete features (e.g., word counts for text classification). In practice, fractional counts such as tf-idf may also work.

• Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

• Performed better than expected: Attributes are relatively independent compared with word vectors in text.

Page 13: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Parameters

• Default alpha = 1• We set alpha = 0.01 for N(N<1) is much smaller than n

Page 14: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Linear Support Vector Classification• The advantages of support vector machines are:

• Effective in high dimensional spaces.• Uses a subset of training points in the decision function (called

support vectors), so it is also memory efficient.• Versatile: different Kernel functions can be specified for the

decision function. Common kernels are provided, but it is also possible to specify custom kernels.

• The disadvantages of support vector machines include:• If the number of features is much greater than the number of

samples, the method is likely to give poor performances.• SVMs do not directly provide probability estimates, these are

calculated using an expensive five-fold cross-validation.

Page 15: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Linear Support Vector Classification

Multiclass support is handled according to One-Vs-All scheme

Radial Basis Function kernel

Page 16: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Parameters• Default parameters

• Penalty parameter C of the error term is 1.0• Dual = true

• We set• C = 0.8• Dual = false

Page 17: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Logistic Regression Classification• Logistic regression is also known in the literature as logit

regression, maximum-entropy classification (MaxEnt) or the log-linear classifier.

• In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

Page 18: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Parameters• We use GridSearchCV to find the best parameters.

Page 19: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Random Forest• A random forest is a meta estimator that fits a number of

decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

• A diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

Page 20: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Parameters• By default, the number of trees in the forest is 10• We set the number of trees in the forest to 100

• More trees will cover more features.• The larger the better, but also the longer it will take to compute.

Page 21: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Outline• Background & Challenge• Preprocessing

• Lemmatization• TF-IDF

• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest

• Evaluation• Conclusion

Page 22: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Evaluation Setup• Python 3.3 for windows• Two library

• NLTK(Natural Language Toolkit)• Scikit-learn

• Evaluation metric• Accuracy• Time

Page 23: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Accuracy

LSVC LR NaiveBayes KNN RandomForest0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.80.78811 0.78922

0.73431

0.7134

0.75422

0.69479

0.77967

0.68292

0.70062

Acurracy w/ custom parameters Acurracy w/ default parameters

Page 24: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Time

• The time for 1NN is longer than 5 hours

LSVC LR NaiveBayes RandomForest0

20

40

60

80

100

120

140

19.53

79.14

0.8

118.27

12.05

75.5

0.8412.56

Time w/ custom parameters Time w/ default parameters

Page 25: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Outline• Background & Challenge• Preprocessing

• Lemmatization• TF-IDF

• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest

• Evaluation• Conclusion

Page 26: USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Conclusion• The preprocessing step dramatically save the execution

time.• Different parameter will significantly affect the result• Considering both accuracy and time, Linear SVC is the

best choice.