USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE

Group 7 – MEI, Yan & HUANG, Chenyu

Outline• Background & Challenge• Preprocessing

• Lemmatization• TF-IDF

• Classification• k-NN• Naive Bayes• Linear Support Vector Classification• Logistic Regression Classification• Random Forest

• Evaluation• Conclusion

Background• A problem from Kaggle• Predict the category of cuisine from the recipe ingredients

• pasta -> Italian, kimchi -> Korean, curry -> Indian

Challenge• Multi-label classification• 6715 features

• If we use binary label for every ingredient in each recipe, the train data will be too large.

• Huge number of labels to train• Quite different from ‘Yes’ or ‘No’ label.

• class-imbalanced• the Italian and Indian food dominate the whole recipe while we

could seldom see one or two cuisine called “cajun_creole”

Lemmatization• Characters hard to handle in the data set

• ™ and ® - delete it, do not influence the result.• French character(é, ù) – replace it by a similar English character,

guarantee the word is unique in the features after replacing.

• Plural form• eggs and egg.• NLTK(Natural Language Toolkit) – lemmatize the word according to

the dictionary in toolkit.

TF-IDF• The problem is similar to label document according to the

content in the document.• Term Frequency–Inverse Document Frequency(TF-IDF),

• a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

• Using the raw frequency of a term• TF(t) means the number of times that term t occurs in content.

• After lemmatization and TF-IDF, we reduce feature from 6715 to 2774

k-NN• scikit-learn implements two different nearest neighbors

classifiers:• KNeighborsClassifier: implements learning based on the k

nearest neighbors of each query point, where k is an integer value specified by the user.

• RadiusNeighborsClassifier: implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user.

• We choose the 1st classifier, set k = 1.• The result should be taken as basic standard of all

classifers’ performance

Naive Bayes• The multinomial Naive Bayes classifier is suitable for

classification with discrete features (e.g., word counts for text classification). In practice, fractional counts such as tf-idf may also work.

• Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

• Performed better than expected: Attributes are relatively independent compared with word vectors in text.

Parameters

• Default alpha = 1• We set alpha = 0.01 for N(N<1) is much smaller than n

Linear Support Vector Classification• The advantages of support vector machines are:

• Effective in high dimensional spaces.• Uses a subset of training points in the decision function (called

support vectors), so it is also memory efficient.• Versatile: different Kernel functions can be specified for the

decision function. Common kernels are provided, but it is also possible to specify custom kernels.

• The disadvantages of support vector machines include:• If the number of features is much greater than the number of

samples, the method is likely to give poor performances.• SVMs do not directly provide probability estimates, these are

calculated using an expensive five-fold cross-validation.

Linear Support Vector Classification

Multiclass support is handled according to One-Vs-All scheme

Radial Basis Function kernel

Parameters• Default parameters

• Penalty parameter C of the error term is 1.0• Dual = true

• We set• C = 0.8• Dual = false

Logistic Regression Classification• Logistic regression is also known in the literature as logit

regression, maximum-entropy classification (MaxEnt) or the log-linear classifier.

• In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

Parameters• We use GridSearchCV to find the best parameters.

Random Forest• A random forest is a meta estimator that fits a number of

decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

• A diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

Parameters• By default, the number of trees in the forest is 10• We set the number of trees in the forest to 100

• More trees will cover more features.• The larger the better, but also the longer it will take to compute.

Evaluation Setup• Python 3.3 for windows• Two library

• NLTK(Natural Language Toolkit)• Scikit-learn

• Evaluation metric• Accuracy• Time

Accuracy

LSVC LR NaiveBayes KNN RandomForest0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.80.78811 0.78922

0.73431

0.7134

0.75422

0.69479

0.77967

0.68292

0.70062

Acurracy w/ custom parameters Acurracy w/ default parameters

Time

• The time for 1NN is longer than 5 hours

LSVC LR NaiveBayes RandomForest0

20

40

60

80

100

120

140

19.53

79.14

0.8

118.27

12.05

75.5

0.8412.56

Time w/ custom parameters Time w/ default parameters

Conclusion• The preprocessing step dramatically save the execution

time.• Different parameter will significantly affect the result• Considering both accuracy and time, Linear SVC is the

best choice.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Documents

Transcript of USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.