Classifying Spend Descriptions with off-the-shelf Learning Components
description
Transcript of Classifying Spend Descriptions with off-the-shelf Learning Components
Saikat Mukherjee, Dmitriy Fradkin, Michael Roth
Integrated Data Systems DepartmentSiemens Corporate Research
Princeton, NJ
Classifying Spend Descriptions with off-the-shelf Learning Components
2 © Siemens Corporate Research
Spend Analytics
• Organizations are engaged in direct and indirect procurement or spend activities
• Indirect spend is diffuse and cuts across the spectrum of goods and services
Proliferation of suppliers for similar goods and services No purchasing synergies between units Failure to stratify procurement Inability to reach bargaining deals with suppliers
• Hence, important for organizations to integrate spend activities
• Integrating spend transactions involves associating each transaction to a hierarchy of commodity codes such as ESN, UNSPSC
• Manually associating transactions to commodity codes is not scalable
Large number of transactions Large number of codes in any commodity scheme
• Focus: Automated techniques for spend classification to commodity codes
3 © Siemens Corporate Research
ESN Commodity Scheme
• ESN is a 3 level hierarchy with increasing specialization down the hierarchy
• Each code represents a particular class of product or service
• In all, 2185 classes across 3 levels
M - Electrical Products
MB – Power Supplies
MBL – UPS Systems “10 KV UPS Battery”
Challenge: Automatically classifying transaction descriptions to ESN Code
4 © Siemens Corporate Research
Challenges in Spend Classification
• Hierarchical text categorization – commodity codes are hierarchical systems (e.g. ESN)
• Sparse text description – most transaction descriptions have less than 5 words
• Erroneous features in description: - Spelling errors- Merge errors – different words are joined together into a single word
(e.g. “MicrosoftOffice” instead of “Microsoft Office”)- Split errors – single word is broken up into multiple words
(e.g. “lap top” instead of “laptop”)
• Descriptions in multiple languages – makes it difficult to apply linguistic techniques
• Large volume of transactions – - could easily be around 0.5 million transactions per month- makes classifier training computationally challenging
• Periodic retraining of classifiers –- commodity coding structures undergo revisions- new samples, with different descriptions, are continually added
5 © Siemens Corporate Research
Classifier: Feature Representation
• Term vector representation of samples
• 2 representation schemes:- Binary: term has weight 1 if it occurs in sample else weight 0
- Weighted: variation of TF-IDF using the hierarchical structure of commodity schemes
weight(f, n) = weight of feature f at node n = (Nfn/Nf) x (1 + log(Nn/Nf
n))
Nfn = number of leaf nodes in sub-tree rooted at n which has at least one positive
sample with feature f
Nf = total number of leaf nodes in entire ESN tree which have at least one positivesample with feature f
Nn = total number of leaf nodes in sub-trees rooted at n and its sibling nodes
Nfn/Nf -> relative importance of sub-tree at n compared to rest of ESN tree
1 + log(Nn/Nfn) -> penalizes features which occur in more leaf nodes at n
6 © Siemens Corporate Research
Classifier: Methods
(A) Support Vector Machine classifiers (LIBSVM implementation)- binary classifier at each node- multiclass approach not feasible due to memory and time constraints- positive samples = all samples in sub-tree- negative samples = all samples in sibling sub-trees- C value = 1 (default)
• SVM: linear support vector machines with binary features• SVM-W: linear support vector machines with weighted features• SVM-WB: linear support vector machines with weighted features and weight balancing
(B) Logistic Regression classifiers- binary classifier as well as multi-class classifier- default settings of parameters
• BBR-L1: bayesian binary regression with Gaussian prior (ridge regression)• BBR-L2: bayesian binary regression with Laplace prior (lasso regression)• BBR-L2w: bayesian binary regression with Laplace prior and weighted features• BMR-L1: multi-class version of BBR-L1• BMR-L2: multi-class version of BBR-L2
7 © Siemens Corporate Research
Classifier: Data for Experiments
• ESN hierarchy:- 2185 nodes over all 3 levels- 17 first level nodes, 192 second level nodes, 1976 leaves
• Training set size = 85,953 samples from 418 leaf nodes
• Test set size = 42,742 samples from 380 leaf nodes
• Feature space size from training set = 69,429 terms
• Evaluate Precision-Recall breakpoints at different levels of the tree
8 © Siemens Corporate Research
PRB Results (select first level nodes)
0102030405060708090
100
N (12897) Q (12232) M (5178) O (4893) F (4519) J (2842) A (76)
BBR-L2 BBR-L1 SVM BBR-L2W SVM-W SVM-WB
BBR-L1 performs best among the different classification methods at the top level
9 © Siemens Corporate Research
PRB Results (select second level nodes)
0102030405060708090
100
NO (5685) NN (2915) FO (2655) OC (1801) FP (1711) OE (875) ON (428)
BBR-L2 BBR-L1 SVM BBR-L2W SVM-W SVM-WB
BBR-L1 and BBR-L2W are competitive among the different classification methods at the second level
10 © Siemens Corporate Research
Overall accuracy at different levels
0102030405060708090
100
Top Level Second Level Leaf Level
BBR-L2 BBR-L1 SVM-W SVM-WB BMR-L2 BMR-L1
BMR-L1 turns out to be the best classifier in overall leaf level accuracy
11 © Siemens Corporate Research
Feature Correction
Correct typos, merge, and split errors
- Noisy channel model: P(O, C) = P(O|C) x P(C) Intended sequence of characters, C, is generated with P(C) and due to noisy channelconverted into observed sequence of characters O with probability P(O|C)
- Source model (P(C)): smoothed frequency counting of 5-gram character sequences CMU-Cambridge Language Modeling toolkit to create 5-gram models
- Channel model (P(O|C): Semi-automatically created training setTypos – wi and wj paired if both start with same character and normalized edit distance less than 0.8 Split error – bi-gram wi, wj and term wk such that wk = concatenation of wi, wj and occurs
10 times more than the bi-gramMerge error – split a word at all character positions and check if resulting bi-gram occurs
more frequently than the original wordThese candidates training examples were manually verified717 unique training pairs159 unique test cases
12 © Siemens Corporate Research
Feature Correction: Results
0102030405060708090
100
Precision Recall
1C-T 1C-S 2C-T
1C-T = 1 character corrections considering each word and bi-gram in test sample separately1C-S = 1 character corrections considering the whole test sample2C-T = 2 character corrections considering each word and bi-gram in test sample separately
Testing – given a test sample, generate all possible 1 character correction variations, scoreeach variation using P(C), the source model, and 5-grams in the sample which are then joinedwith the channel model P(O|C).
13 © Siemens Corporate Research
System
- Java-based system forclassifying Spend descriptionsto ESN codes
- Users can browse training samples andESN descriptions
- Backend Derby database which storestraining data
- Ability to train hierarchical SVM andBMR classifiers through the system
- Test in batch mode by loading adatabase of samples
- Corrected results from users can befed back to the system as training data
14 © Siemens Corporate Research
Related Work
There are many formulations for combining binary classifiers into a multiclass classifier: error-correcting codes [Dietterich & Bakiri ‘95], 1-vs-each and 1-vs-all[Friedman ‘96; Weston & Watkins ’98; Crammer & Singer ‘00].
There are several approaches to hierarchical classification:1. build a single "flat” classifier that would map an instance directly to a leaf; or 2. build a hierarchical classifier by constructing a model at every node; or3. use a method developed specifically for hierarchical classification.
The 1st approach ignores the hierarchical structure of the data and usually leads to worse results than the 2nd [Dumais & Chen ‘00; Koller & Sahami ‘97].
Some recent work [Cai & Hoffman ’04; Rouse et al ’05] focused on the 3rd approach. Both of these involved new formulations of SVM that take into account hierarchical
structure of the classes. While the results are encouraging, usable implementations are currently not available.
[Brill & Moore ’00; Kolak & Resnik ‘05] have previously explored the noisy channel framework in computational linguistics.
15 © Siemens Corporate Research
Discussion
We have described how off-the-shelf learning tools can be applied to automatedspend classification:• experimental results with SVM and BMR classifiers; • a noisy channel framework for feature correction.
Incremental algorithms are the only way to reliably handle frequent retraining and increasingly large datasets:
• for SVM: [Cauwenberghs & Poggio ’00; Laskov ett al. ’06]• for BBR: [Balakrishnan & Madigan ‘08]. However, the accuracy tends to be lower than for batch methods, and off-the-shelf implementations
are not readily available.
Improvements in accuracy can be achieved by careful selection of classifier parameters,but such tuning can only be performed in the classifiers are extremely fast and scalable.
Additional information such as supplier names and purchase volumes could be used when available, especially in standardized ways such as with Dun and Bradstreet codes which could then be mapped to product types and commodity codes.
Our feature correction techniques are currently language agnostic and could be improved if transactions can be geographically localized and linguistic cues of corresponding languages incorporated.