My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
-
Upload
dhiana-deva -
Category
Data & Analytics
-
view
331 -
download
3
description
Transcript of My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
Higgs ChallengeMy first attempt at Kaggle: 755st and proud!
@dhianadeva
Err… Kaggle?!Platform for data science competitions
Machine Learning, Big Data, Statistics, Data mining ...
Community for data scientistsUsers, leaderboard, forums …
Sponsors!
$$$posored competitions!
We don’t need no PhD!
Yes, we can!My guilty pleasure:
Student license of MATLAB <3
Open source alternatives:Python + Scikit + Numpy + …R + randomForest + e1071 + caret + …Octave!?
Higgs Challenge
DatasetsTraining (labeled):
250k events30 featuresEvent id, weight and class (s/b)
Test (unlabeled):18% Public (500k events)72% Private
training.csvEventId , DER_mass_MMC , … , Weight , Class100000 , 138.47 , … , 0.00265331133733 , s100001 , 160.937 , … , 2.23358448717 , b100002 , -999.0 , … , 2.34738894364 , b100003 , 143.905 , … , 5.44637821192 , b…
test.csvEventId , DER_mass_MMC , … , PRI_jet_all_pt350000 , -999 , … , -0.0350001 , 106.398 , … , 47.575350002 , 117.794 , … , 0.0350003 , 135.861 , … , 0.0…
submission.csvEventId , RankOrder , Class350000 , 262328 , b350001 , 201479 , b350002 , 212810 , b350003 , 134945 , b…
End-to-end
A little math...
(Aproximate Median Significance)
755th/1785 secretsI’ve entered on the last 8 days of the 127-days challenge and could overtake more than half of the competitors using:
MATLAB 2014b (student license)Neural Networks Toolbox20$ EC2 at Amazon Web Services9 code files totaling 674 words
Neural netwhat?!
Neurons
Inputs Output
For now, a Black box!
OutputInputs
It trains
Output
Inputs
Target
Error
It runs
OutputInputs
Moonlighting!
1. nprtool2. fixunknowns3. trainlm4. processpca5. 0.8 threshold6. ams threshold pick7. hidden neurons pick8. 0.25*targets + amsweights
8 days!
Some stats...
Day 1
Day 2
Day 3
Day 4
Day 5
Day 6
Day 7
Day 8
Oops!(weighted errors using ams, regularization, mapstd, … nothing worked!)
Lessons learned+ Optimize self-learning doing things from scratch (or
from default baseline)
+ Kaggle is way funnier than studying with traditional datasets (iris, cancer, thyroid...)
+ Data science needs good engineering practices!
+ The competition fact sheet was a great way of accessing what I know I know, what I know I don’t know…
Let’s hack?!Re-considering PCAPCD?Dimensionality ReductionStop on best AMS (hack nn toolbox!)EnsembleAuto-encoderMATLAB unit testsMATLAB continuous integration
Thanks! ;)