Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong...

17
Generalized and Heuristic-Free Featur e Construction for Improved Accuracy Wei Fan , Erheng Zhong , Jing Peng*, Olivier Verscheure , Kun Zhang § , Jiangtao Ren , Rong Yan and Qiang Yang IBM T. J. Watson Research Center Sun Yat-Sen University *Montclair State University § Xavier University of Lousiana Facebook, Inc Hong Kong University of Science and Technology Construction works when the original pool is not good enough (feature selection won’t work) Too many choices to construct Evaluate on local space not always on all the data points Better Automated
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    3

Transcript of Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong...

Page 1: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Generalized and Heuristic-Free Feature Construction

for Improved Accuracy

Wei Fan‡, Erheng Zhong†, Jing Peng*, Olivier Verscheure‡, Kun Zhang§, Jiangtao Ren†,

Rong Yan‡ and Qiang Yang¶

‡IBM T. J. Watson Research Center†Sun Yat-Sen University

*Montclair State University §Xavier University of Lousiana

Facebook, Inc¶Hong Kong University of Science and Technology

• Construction works when the original pool is not good enough (feature selection won’t work)• Too many choices to construct• Evaluate on local space not always on all the data points• Better Automated

Page 2: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Feature Construction -- Example

3 1 2F F F

XOR like problemNot linearly separable:use both features to construct a “cross” model

Linearly separable:one feature F3 is enough

Page 3: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.
Page 4: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Main Challenges

To address these, we have 3 main steps

1. Too many ways to construct new features: xy, x-y,x/y, etc• Divide and Conquer

2. Insignificant on the whole data set - highly discriminant in local region • Local Feature Construction and Evaluation

3. Automated – not based on domain knowledge• Automatically adjusted weighting rules

4 binary operators, 1000 original features up to

constructed features64 10

F2 not very usefulunless consideredwith F1

Page 5: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Divide-Conquer

Local Feature Construction and Evaluation

Stopping Criteria: 1.The number of instances in the node is smaller than a threshold2.The node only contains examples from one class

ConstructedFeatures(org + new)

Page 6: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Every node …

(1)

F

(3)

(4)

Weighted

1. Random subset of orig features

2. “Weighted random” subset of operators

(2)

Weighting Rule

Page 7: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Weighting Rule

• Weight is proportional to the info-gain of features constructed by the operator.

Sum of its past info gain

Page 8: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Properties

• Number of features is bounded.

• Highly weighted operator is expected to perform better in its two child nodes (see paper)

• FCTree’s error is bounded.

– also explains why the features are of high quality

Page 9: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Experiment – Data Set

• UCI repository (Balanced)• Caltech-256 database: An image database of 256 obje

ct categories. Each category is processed via a 177-dimensional color correlogram (Balanced)

• Landmine collection: Collected via remote sensing techniques (Skewed)

• Nuclear Ban data source: A nuclear explosion detection problem used by ICDM’08 contest (Skewed)

Page 10: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Experiment -- Baseline methods

• Original Features• TFC:

– enumerates all possible features generated by operators

• NB,SVM and C45• Operators

• FCTree:

Page 11: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Performance--Blannced Data

Best in 23 out of 33 comparisions

Page 12: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Performance--Skew Data

Best in 25 out of 33 comparisions

Page 13: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Scalability Analysis

Page 14: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Strength of Weighting Rule

Page 15: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Original

FCTree

177 dimensioncolor correlogram

Page 16: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Conclusion

• Key points– Divide-conquer to avoid exhaustive enu

meration;– Local feature construction subspace

evaluation– Weighting rules based search: domain

knowledge free and provable performance.

• Code and data available from the authors

Page 17: Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.