Classifying Spend Descriptions with off-the-shelf Learning Components

Saikat Mukherjee, Dmitriy Fradkin, Michael Roth

Integrated Data Systems DepartmentSiemens Corporate Research

Princeton, NJ

Classifying Spend Descriptions with off-the-shelf Learning Components

2 © Siemens Corporate Research

Spend Analytics

• Organizations are engaged in direct and indirect procurement or spend activities

• Indirect spend is diffuse and cuts across the spectrum of goods and services

Proliferation of suppliers for similar goods and services No purchasing synergies between units Failure to stratify procurement Inability to reach bargaining deals with suppliers

• Hence, important for organizations to integrate spend activities

• Integrating spend transactions involves associating each transaction to a hierarchy of commodity codes such as ESN, UNSPSC

• Manually associating transactions to commodity codes is not scalable

Large number of transactions Large number of codes in any commodity scheme

• Focus: Automated techniques for spend classification to commodity codes


ESN Commodity Scheme

• ESN is a 3 level hierarchy with increasing specialization down the hierarchy

• Each code represents a particular class of product or service

• In all, 2185 classes across 3 levels

M - Electrical Products

MB – Power Supplies

MBL – UPS Systems “10 KV UPS Battery”

Challenge: Automatically classifying transaction descriptions to ESN Code


Challenges in Spend Classification

• Hierarchical text categorization – commodity codes are hierarchical systems (e.g. ESN)

• Sparse text description – most transaction descriptions have less than 5 words

• Erroneous features in description: - Spelling errors- Merge errors – different words are joined together into a single word

(e.g. “MicrosoftOffice” instead of “Microsoft Office”)- Split errors – single word is broken up into multiple words

(e.g. “lap top” instead of “laptop”)

• Descriptions in multiple languages – makes it difficult to apply linguistic techniques

• Large volume of transactions – - could easily be around 0.5 million transactions per month- makes classifier training computationally challenging

• Periodic retraining of classifiers –- commodity coding structures undergo revisions- new samples, with different descriptions, are continually added


Classifier: Feature Representation

• Term vector representation of samples

• 2 representation schemes:- Binary: term has weight 1 if it occurs in sample else weight 0

- Weighted: variation of TF-IDF using the hierarchical structure of commodity schemes

weight(f, n) = weight of feature f at node n = (Nfn/Nf) x (1 + log(Nn/Nf

n))

Nfn = number of leaf nodes in sub-tree rooted at n which has at least one positive

sample with feature f

Nf = total number of leaf nodes in entire ESN tree which have at least one positivesample with feature f

Nn = total number of leaf nodes in sub-trees rooted at n and its sibling nodes

Nfn/Nf -> relative importance of sub-tree at n compared to rest of ESN tree

1 + log(Nn/Nfn) -> penalizes features which occur in more leaf nodes at n


Classifier: Methods

(A) Support Vector Machine classifiers (LIBSVM implementation)- binary classifier at each node- multiclass approach not feasible due to memory and time constraints- positive samples = all samples in sub-tree- negative samples = all samples in sibling sub-trees- C value = 1 (default)

• SVM: linear support vector machines with binary features• SVM-W: linear support vector machines with weighted features• SVM-WB: linear support vector machines with weighted features and weight balancing

(B) Logistic Regression classifiers- binary classifier as well as multi-class classifier- default settings of parameters

• BBR-L1: bayesian binary regression with Gaussian prior (ridge regression)• BBR-L2: bayesian binary regression with Laplace prior (lasso regression)• BBR-L2w: bayesian binary regression with Laplace prior and weighted features• BMR-L1: multi-class version of BBR-L1• BMR-L2: multi-class version of BBR-L2


Classifier: Data for Experiments

• ESN hierarchy:- 2185 nodes over all 3 levels- 17 first level nodes, 192 second level nodes, 1976 leaves

• Training set size = 85,953 samples from 418 leaf nodes

• Test set size = 42,742 samples from 380 leaf nodes

• Feature space size from training set = 69,429 terms

• Evaluate Precision-Recall breakpoints at different levels of the tree


PRB Results (select first level nodes)

0102030405060708090

100

N (12897) Q (12232) M (5178) O (4893) F (4519) J (2842) A (76)

BBR-L2 BBR-L1 SVM BBR-L2W SVM-W SVM-WB

BBR-L1 performs best among the different classification methods at the top level


PRB Results (select second level nodes)

0102030405060708090

100

NO (5685) NN (2915) FO (2655) OC (1801) FP (1711) OE (875) ON (428)

BBR-L2 BBR-L1 SVM BBR-L2W SVM-W SVM-WB

BBR-L1 and BBR-L2W are competitive among the different classification methods at the second level


Overall accuracy at different levels

0102030405060708090

100

Top Level Second Level Leaf Level

BBR-L2 BBR-L1 SVM-W SVM-WB BMR-L2 BMR-L1

BMR-L1 turns out to be the best classifier in overall leaf level accuracy


Feature Correction

Correct typos, merge, and split errors

- Noisy channel model: P(O, C) = P(O|C) x P(C) Intended sequence of characters, C, is generated with P(C) and due to noisy channelconverted into observed sequence of characters O with probability P(O|C)

- Source model (P(C)): smoothed frequency counting of 5-gram character sequences CMU-Cambridge Language Modeling toolkit to create 5-gram models

- Channel model (P(O|C): Semi-automatically created training setTypos – wi and wj paired if both start with same character and normalized edit distance less than 0.8 Split error – bi-gram wi, wj and term wk such that wk = concatenation of wi, wj and occurs

10 times more than the bi-gramMerge error – split a word at all character positions and check if resulting bi-gram occurs

more frequently than the original wordThese candidates training examples were manually verified717 unique training pairs159 unique test cases


Feature Correction: Results

0102030405060708090

100

Precision Recall

1C-T 1C-S 2C-T

1C-T = 1 character corrections considering each word and bi-gram in test sample separately1C-S = 1 character corrections considering the whole test sample2C-T = 2 character corrections considering each word and bi-gram in test sample separately

Testing – given a test sample, generate all possible 1 character correction variations, scoreeach variation using P(C), the source model, and 5-grams in the sample which are then joinedwith the channel model P(O|C).


System

- Java-based system forclassifying Spend descriptionsto ESN codes

- Users can browse training samples andESN descriptions

- Backend Derby database which storestraining data

- Ability to train hierarchical SVM andBMR classifiers through the system

- Test in batch mode by loading adatabase of samples

- Corrected results from users can befed back to the system as training data


Related Work

There are many formulations for combining binary classifiers into a multiclass classifier: error-correcting codes [Dietterich & Bakiri ‘95], 1-vs-each and 1-vs-all[Friedman ‘96; Weston & Watkins ’98; Crammer & Singer ‘00].

There are several approaches to hierarchical classification:1. build a single "flat” classifier that would map an instance directly to a leaf; or 2. build a hierarchical classifier by constructing a model at every node; or3. use a method developed specifically for hierarchical classification.

The 1st approach ignores the hierarchical structure of the data and usually leads to worse results than the 2nd [Dumais & Chen ‘00; Koller & Sahami ‘97].

Some recent work [Cai & Hoffman ’04; Rouse et al ’05] focused on the 3rd approach. Both of these involved new formulations of SVM that take into account hierarchical

structure of the classes. While the results are encouraging, usable implementations are currently not available.

[Brill & Moore ’00; Kolak & Resnik ‘05] have previously explored the noisy channel framework in computational linguistics.


Discussion

We have described how off-the-shelf learning tools can be applied to automatedspend classification:• experimental results with SVM and BMR classifiers; • a noisy channel framework for feature correction.

Incremental algorithms are the only way to reliably handle frequent retraining and increasingly large datasets:

• for SVM: [Cauwenberghs & Poggio ’00; Laskov ett al. ’06]• for BBR: [Balakrishnan & Madigan ‘08]. However, the accuracy tends to be lower than for batch methods, and off-the-shelf implementations

are not readily available.

Improvements in accuracy can be achieved by careful selection of classifier parameters,but such tuning can only be performed in the classifiers are extremely fast and scalable.

Additional information such as supplier names and purchase volumes could be used when available, especially in standardized ways such as with Dun and Bradstreet codes which could then be mapped to product types and commodity codes.

Our feature correction techniques are currently language agnostic and could be improved if transactions can be geographically localized and linguistic cues of corresponding languages incorporated.

Classifying Spend Descriptions with off-the-shelf Learning Components

Documents

Transcript of Classifying Spend Descriptions with off-the-shelf Learning Components