Simpler Machine Learning with SKLL 1.0
-
Upload
daniel-blanchard -
Category
Data & Analytics
-
view
1.913 -
download
0
description
Transcript of Simpler Machine Learning with SKLL 1.0
Simpler Machine Learning with SKLL 1.0
Dan Blanchard Educational Testing Service
PyData NYC 2014
Survived Perished
Survived Perishedfirst class, female,
1 sibling, 35 years old
third class, female,
2 siblings, 18 years old
second class, male,
0 siblings, 50 years old
Survived Perishedfirst class, female,
1 sibling, 35 years old
third class, female,
2 siblings, 18 years old
second class, male,
0 siblings, 50 years old
Can we predict survival from data?
SciKit-Learn Laboratory
SKLL
It's where the learning happens
Learning to Predict Survival$ ./make_titanic_example_data.py Loading train.csv... done Writing titanic/train/socioeconomic.csv...done Writing titanic/train/family.csv...done Writing titanic/train/vitals.csv...done Writing titanic/train/misc.csv...done Writing titanic/train+dev/socioeconomic.csv...done Writing titanic/train+dev/family.csv...done Writing titanic/train+dev/vitals.csv...done Writing titanic/train+dev/misc.csv...done Writing titanic/dev/socioeconomic.csv...done Writing titanic/dev/family.csv...done Writing titanic/dev/vitals.csv...done Writing titanic/dev/misc.csv...done Loading test.csv... done Writing titanic/test/socioeconomic.csv...done Writing titanic/test/family.csv...done Writing titanic/test/vitals.csv...done Writing titanic/test/misc.csv...done
1. Split up given training set: train (80%) and dev (20%)
Learning to Predict Survival2. Pick classifiers to try:
1. Decision Tree
2. Naive Bayes
3. Random forest
4. Support Vector Machine (SVM)
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
directory with feature files for training learner
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
directory with feature files for evaluating performance
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
# of siblings, spouses, parents, children
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
departure port
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
fare & passenger class
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
sex & age
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
directory to store evaluation results
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
directory to store trained models
Learning to Predict Survival4. Run the configuration file with run_experiment$ run_experiment evaluate.cfg
Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done Loading dev/misc.csv... done Loading dev/socioeconomic.csv... done Loading dev/vitals.csv... done Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done ...
Learning to Predict SurvivalExperiment Name: Titanic_Evaluate_Untuned SKLL Version: 1.0.0 Training Set: train (712) Test Set: dev (179) Feature Set: ["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Scikit-learn Version: 0.15.2 Total Time: 0:00:02.065403
+-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [96] | 19 | 0.865 | 0.835 | 0.850 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 15 | [49] | 0.721 | 0.766 | 0.742 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8100558659217877
5. Examine results
Aggregate Evaluation Results
Dev. Accuracy Learner
0.8101 RandomForestClassifier
0.7989 DecisionTreeClassifier
0.7709 SVC
0.7095 MultinomialNB
[General] experiment_name = Titanic_Evaluate task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Tuning] grid_search = true objective = accuracy
[Output] results = output
Tuning learnerCan we do better than default hyperparameters?
Tuned Evaluation Results
Untuned Accuracy Tuned Accuracy Learner
0.8101 0.8380 RandomForestClassifier
0.7989 0.7989 DecisionTreeClassifier
0.7709 0.8156 SVC
0.7095 0.7095 MultinomialNB
[General] experiment_name = Titanic_Predict task = predict [Input] train_directory = train+dev test_directory = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Tuning] grid_search = true objective = accuracy
[Output] results = output
Using All Available DataUse training and dev to generate predictions on test
Test Set Accuracy
Train only Train + DevLearner
Untuned Tuned Untuned Tuned
0.727 0.756 0.746 0.780 RandomForestClassifier
0.703 0.742 0.670 0.742 DecisionTreeClassifier
0.608 0.679 0.612 0.679 SVC
0.627 0.627 0.622 0.622 MultinomialNB
Advanced SKLL Features• Read & write .arff, .csv, .jsonlines, .libsvm, .megam, .ndj, and .tsv data
• Parameter grids for all supported scikit-learn learners
• Custom learners• Parallelize experiments on
DRMAA clusters via GridMap• Ablation experiments
• Collapse/rename classes from config file
• Feature scaling• Rescale predictions to be closer
to observed data• Command-line tools for joining,
filtering, and converting feature files
• Python API
Currently Supported LearnersClassifiers Regressors
Linear Support Vector Machine Elastic Net
Logistic Regression Lasso
Multinomial Naive Bayes Linear
AdaBoost
Decision Tree
Gradient Boosting
K-Nearest Neighbors
Random Forest
Stochastic Gradient Descent
Support Vector Machine
Contributors• Nitin Madnani
• Mike Heilman
• Nils Murrugarra Llerena
• Aoife Cahill
• Diane Napolitano
• Keelan Evanini
• Ben Leong
References• Dataset: kaggle.com/c/titanic-gettingStarted
• SKLL GitHub: github.com/EducationalTestingService/skll
• SKLL Docs: skll.readthedocs.org
• Titanic configs and data splitting script in examples dir on GitHub
@dsblanch
dan-blanchard
Bonus Slides
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
confusion matrix
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
overall accuracy on test set
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
precision, recall, f-score for each class
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
tuned model parameters
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
objective function score on test set
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
per-fold evaluation results
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
per-fold training set obj. scores
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL APIimport numpy as npfrom os.path import joinfrom skll import FeatureSet, NDJWriter, Writer
# Create some training exampleslabels = []ids = []features = []for i in range(num_train_examples): labels.append("dog" if i % 2 == 0 else "cat") ids.append("{}{}".format(y, i)) features.append({"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4)})feat_set = FeatureSet('training', ids, labels=labels, features=features)
# Write them to a filetrain_path = join(_my_dir, 'train', 'test_summary.jsonlines')Writer.for_path(train_path, feat_set).write()# OrNDJWriter.(train_path, feat_set).write()