Rapid pruning of search space through hierarchical matching

RAPID PRUNING OF SEARCH SPACE THROUGH HIERARCHICAL MATCHING Chandra Mouleeswaran Machine Learning Scientist, ThreatMetrix Inc.

5/2/13 1

My Background •  Machine Learning Scien8st at ThreatMetrix Inc. •  Co-‐ Chair, Developer Programs, IntelliFest.org, Oct 2013,

San Diego, CA Career Path -‐  Siemens Corporate Research: Learning & Expert Systems -‐  Technology division of Donaldson, LuQin and JenreSe

company (Pershing): Ar8ficial Intelligence Group -‐ Network Monitoring

-‐  Several startups: Classifica8on, Web Crawling, Security, Financial Trading etc.

5/2/13 2

Outline

•  Task descrip8on •  Approaches •  Why search paradigm? •  Hierarchical matching •  Results •  Acknowledgments

5/2/13 3

The Device Iden8fica8on Task

•  Computa8onally, it’s a CLASSIFICATION problem: { a0, a1, a2, a3……….. an } è { ci } ai = ( aSribute | field | key ) value ci = ( label | signature | class | hash )

•  Returning devices should be correctly iden8fied within certain tolerances

•  New classes may be created if a good match is not found in the repository of known devices

•  Devices age out, based on data reten8on policy 5/2/13 4

Task Challenges

•  Extremely vola8le aSributes •  There are no pivot aSributes to divide and conquer the search space

•  Changing distribu8ons •  Emphasis on PRECISION •  Stringent RESPONSE 8me

5/2/13 5

Engineering Challenges

•  Precision (accuracy) and latency (response 8me) are antagonis8c constraints

•  Project management

Repository Size (millions)

Load (TPS)

Latency (ms)

Project start 28 200 < 100

Present 280 300 < 100

Change 10 X 1.5 X None

5/2/13 6

Approaches

•  Rules engine •  Learning models •  Vector space models Need an enterprise grade solu8on!

5/2/13 7

Rules Engine

•  No experts •  Number of rules? •  Maintenance?

Not a viable approach!

5/2/13 8

Learning Models

•  Most machine learning methods deal predominantly with binary classifica8on problems (eg. fraud / not fraud) or a small number of target classes

•  Few exemplars for each class •  ASribute values may be unbounded •  ASributes may not follow a natural progression

5/2/13 9

Learning Models …

•  Unsupervised learning such as clustering methods would make good models, but not good enough to be of prac8cal use. Any simplifica8on process will compromise on accuracy

•  Ability to explain is cri8cal •  Tend to ignore domain knowledge Challenge in providing enterprise solu8on

5/2/13 10

Thoughts

•  No comparable applica8on with such requirements

•  Build and deploy a classifier that explains itself easily, scales temporally and offers quick response

•  Use domain knowledge to guide verifica8on •  Improve the classifier through machine learning methods by analyzing performance in the field

5/2/13 11

Vector-‐Space Models

•  Similarity based search make vector-‐space model a good choice for genera8ng selec8ons

•  Given the vola8le nature of data, informa8on retrieval (IR) systems can adapt easily

•  Good at neighborhood search Sensi8ve to individual aSribute changes!

5/2/13 12

Sources of Inspira8on

•  Lucene/Solr features •  Documenta8on from (erstwhile) Lucid Imagina8on

•  Ease with which Lucene/Solr could be installed and explored

Very short learning curve for novices!

5/2/13 13

Feature Selec8on

•  Primi8ve and derived aSributes •  Entropy •  Distribu8on

5/2/13 14

Domain

•  Devices come with structural informa8on but not much grammar or seman8cs

•  Bag-‐of-‐words (single field) approach is fast but not precise

•  Using all fields is precise but response is slow Now what?

5/2/13 15

Disjunc8on Max •  Matrix of all possible combina8ons of user input query and document fields

•  Transforms into a Boolean query of Disjunc8onMaxQueries of each row

•  Maximum score of sub clauses Is used by Disjunc8onMaxQuery

•  No single term in user input dominates This is needed! Src: SearchHub and LucidWorks 5/2/13 16

DisMax Experiments (index size = 60 Million)

Scenario 1

mm=2 Solr fields = { a1, a2, a3 } Values= { phrase1, phrase2, phrase3} Must-‐Match Clauses Latency: YES (35 ms) Precision: NO (20% failure)

5/2/13 17

Scenario 2

mm = 50 % Solr fields = { a1 } Values= { term1, term2, term3 …. termn } Should-‐Match Clauses Latency: NO (> 2 seconds) Precision: YES (> 98%)

Possible Workaround

•  Look-‐ahead: Customize Lucene/Solr to do a branch-‐and-‐bound search, bail out on some lower bound score

•  Minimize candidates for DisMax search -‐  reduce total number of Solr instances to search -‐  reduce total number of disjunc8ve terms

[ Empirical es8mate: tn = 2 * tn-‐1 where t = 8me & n = number of disjunc8ve terms]

5/2/13 18

Phrases over Terms

•  Used coloca8on (co-‐occurrence matrix) to determine most common phrases

•  Delete terms covered by phrases •  Add stop words based on frequency analysis •  Ensure precision is preserved through regression tests

Reduced the number of DisMax terms by 30%

5/2/13 19

Sources of Inspira8on

•  Planning in a Hierarchy of Abstrac8on Spaces, Ar8ficial Intelligence, Vol. 5, No. 2, pp. 115-‐135 (1974)

•  Search Reduc8on in Hierarchical Problem Solving, Proc. Of the 9th IJCAI, AAAI Press, Menlo Park, CA (1991)

•  Excep8onal Data Quality Using Intelligent Matching and Retrieval, AI Magazine, AAAI Press (Spring 2010)

5/2/13 20

Hierarchical Matching

Bag of words

Models Phrases

Filters DisMax

Query Formulator

Domain-‐specific paSerns

CSV/JSON

Solr instances selector

To Solr Servers

5/2/13 21

Verifica8on

Conflict Resolu8on

•  Top n candidates are returned from each Solr instance

•  They are ranked based on custom verifica8on module

•  Ties are broken using recency •  Top candidate is persisted and returned along with custom score

5/2/13 22

Comments

•  Dismax performs mul8dimensional match •  Extracted mul8ple filters and arranged them hierarchically

•  Separa8on of selec8on and evalua8on -‐  Selec8on = approximate solu8on -‐  Evalua8on = refinement

5/2/13 23

Where 8me went..

•  ASribute selec8on •  Ranking •  Op8miza8on •  Index re-‐genera8on •  Regression tes8ng

5/2/13 24

Sources for Tune Up

•  Scaling Solr, Lucene Revolu8on, May 2011 •  Prac8cal Search with Solr: Beyond just Looking it Up, Lucid Imagina8on, May 2010

5/2/13 25

Tes8ng

•  Precision tes8ng using self and mixed modes •  Latency tests

-‐  custom harness for stand-‐alone tests -‐  integrated tests with JMeter framework

5/2/13 26

Results

5/2/13 27

Latency Percen8les

original edismax Ini8al solu8on

Op8miza8on 2: Domain paSerns, Stop words, de-‐dupe

Op8miza8on 1: Filters, Focused search, verifica8on

5/2/13 28

5/2/13 29

Response Times over Time

5/2/13 30

Project Execu8on

•  Agile Methodology •  Risk mi8ga8on through primary and con8ngency plans

•  Rapid prototyping followed by good sozware engineering prac8ces

•  Evalua8ng DSE (DataStax) & Solr Cloud

5/2/13 31

Gleanings

•  You can classify anything with Lucene/Solr, lexicon is your own

•  The ques8on is not whether Lucene/Solr can solve a par8cular classifica8on problem, but whether you can priori8ze among the many ways of doing it

•  If you run into a problem, someone has solved it or will solve it in the near future

5/2/13 32

Gleanings …

•  Deal with accuracy before latency •  If precision, latency and scale are all cri8cal to your domain, expect to invest some8me in hierarchical abstrac8ons

•  Index once, run any8me, anywhere, does not apply during development

•  Throwing all data at Lucene/Solr will not work for mission cri8cal applica8ons

•  Rapid prototyping and willingness to fail

5/2/13 33

Summary

Simplify and match at mul0ple levels of abstrac0on

5/2/13 34

Contributors

Chandra Mouleeswaran Research & Prototyping

Fang Chen Research & Prototyping

Luke Mertens Produc8za8on & Scalability

Brent Pearson Release Management

Tracy Hsu Precision Tes8ng & QA

5/2/13 35

Srinivas Nayani Deployment & QA

COMMENTS & FEEDBACK: Chandra Mouleeswaran cmouleeswaran@threatmetrix.com

5/2/13 36

Rapid pruning of search space through hierarchical matching

Education

Transcript of Rapid pruning of search space through hierarchical matching

Stable haptic interaction based on adaptive hierarchical ...xguo/CVM15b.pdf · Stable haptic interaction based on adaptive hierarchical shape matching 255 this paper, we extend the

Browsing Hierarchical Data with Multi-level Dynamic ...The PDQ Tree-browser (Pruning with Dynamic Queries) visualization tool was speciﬁed, designed and developed for this purpose.

14-In-Depth Packet Inspection Using a Hierarchical Pattern Matching Algorithm

Stable haptic interaction based on adaptive hierarchical ... · Stable haptic interaction based on adaptive hierarchical shape matching Yuan Tian 1( 2), ... smooth haptic rendering

Pruning Tools Hand clippers Scissor type Snapcut type Pruning shears Loppers Pruning saw.

Region-Based Hierarchical Image Matching - Home | …web.engr.oregonstate.edu/~sinisa/research/publications/...Abstract This paper presents an approach to region-based hierarchical

Video Scene Categorization by 3D Hierarchical Histogram ... · Video Scene Categorization by 3D Hierarchical Histogram Matching Paritosh Gupta1, Sai Sankalp Arrabolu1, Mathew Brown2

Corrective Pruning for Deciduous Trees · Corrective Pruning for Deciduous Trees The Importance of Pruning Pruning to improve health and structure is critical for long-lived trees.

PROPER TREE PRUNING TECHNIQUES · PROPER TREE PRUNING TECHNIQUES Pruning is an essential and accepted practice, which keeps trees and shrubs healthy. • Bad pruning is worse than

Pruning Mature Trees - TreesAreGood.org · Pruning Mature Trees Understand the pruning needs of mature trees and the proper pruning techniques for their care. Pruning is the …

Minutiae-Based Template Synthesis and Matching Using ...bebis/BTAS07.pdfMinutiae-Based Template Synthesis and Matching Using Hierarchical Delaunay Triangulations Tamer Uz, George Bebis,

Pruning fruit, ornamental and flowering trees - Fleming's · PDF fileFramework pruning and detail pruning Pruning can be divided into two parts ‘Framework pruning’ is exactly that;

Overview Filtering Architecturespaul_o/Courses/filters_arch.pdf · Cambridge University Press, 2002. • Handbook of exact string matching algorithms, Charras, ... • Hierarchical

Hierarchical Neural Architecture Search for Deep Stereo ......Hierarchical Neural Architecture Search for Deep Stereo Matching *Xuelian Cheng1;5, *Yiran Zhong2;6, Mehrtash Harandi1;7,

Drsp dimension reduction for similarity matching and pruning of time series data streams

Hierarchical High Level Information Fusion Using Graph ... · Using Graph Structures, Subgraph Matching and State Space Search ... Database Data Graph Generator INFERD Graph Matching

Enhanced Low-Resolution Pruning for Fast Full-Search ...vision.deis.unibo.it/fede/papers/acivs09-2.pdf · Enhanced Low-Resolution Pruning for Fast Full-Search Template Matching 111

Pruning Trees and Shrubs - extension.arizona.edu · 1 Pruning Trees and Shrubs Ursula Schuch University of Arizona School of Plant Sciences Publications on Pruning Pruning Deciduous

Questions? Pruning Stik - Gardening Tools, Craft … Pruning Stick... · Pruning Stik® 9234 Pruning Stik® Strap Replacement Continued Questions? Call Customer Service: 1-866-348-5661

Browsing Hierarchical Data with Multi-Level Dynamic …ben/papers/Kumar1995Browsing.pdf · · 2012-02-09Browsing Hierarchical Data with Multi-Level Dynamic Queries and Pruning ...

Hierarchical Neural Architecture Search for Deep Stereo ......Hierarchical Neural Architecture Search for Deep Stereo Matching Xuelian Cheng1;5, Yiran Zhong2;6, Mehrtash Harandi1;7,