Rapid pruning of search space through hierarchical matching

36
RAPID PRUNING OF SEARCH SPACE THROUGH HIERARCHICAL MATCHING Chandra Mouleeswaran Machine Learning Scientist, ThreatMetrix Inc. 5/2/13 1

description

Presented by Chandra Mouleeswaran, Co Chair at Intellifest.org, ThreatMetrix This talk will present our experiences in using Lucene/Solr to the classification of user and device data. On a daily basis, ThreatMetrix, Inc., handles a huge volume of volatile data. The primary challenge is rapidly and precisely classifying each incoming transaction, by searching a huge index within a very strict latency specification. The audience will be taken through the various design choices and the lessons learned. Details on introducing a hierarchical search procedure that systematically divides the search space into manageable partitions, yet maintaining precision, will be presented.

Transcript of Rapid pruning of search space through hierarchical matching

Page 1: Rapid pruning of search space through hierarchical matching

RAPID PRUNING OF SEARCH SPACE THROUGH HIERARCHICAL MATCHING Chandra Mouleeswaran Machine Learning Scientist, ThreatMetrix Inc.

5/2/13   1  

Page 2: Rapid pruning of search space through hierarchical matching

My  Background  •  Machine  Learning  Scien8st  at  ThreatMetrix  Inc.  •  Co-­‐  Chair,  Developer  Programs,  IntelliFest.org,  Oct  2013,  

San  Diego,  CA    Career  Path  -­‐  Siemens  Corporate  Research:  Learning  &  Expert  Systems  -­‐  Technology  division  of  Donaldson,  LuQin  and  JenreSe  

company  (Pershing):  Ar8ficial  Intelligence  Group  -­‐  Network  Monitoring  

-­‐  Several  startups:  Classifica8on,  Web  Crawling,  Security,  Financial  Trading  etc.  

5/2/13   2  

Page 3: Rapid pruning of search space through hierarchical matching

Outline  

•  Task  descrip8on  •  Approaches  •  Why  search  paradigm?  •  Hierarchical  matching    •  Results  •  Acknowledgments    

5/2/13   3  

Page 4: Rapid pruning of search space through hierarchical matching

The  Device  Iden8fica8on  Task  

•  Computa8onally,  it’s  a  CLASSIFICATION  problem:  {  a0,  a1,  a2,  a3………..  an  }    è  {  ci  }  ai  =  (  aSribute  |  field  |  key  )  value  ci  =  (  label  |  signature  |  class  |  hash  )  

•  Returning  devices  should  be  correctly  iden8fied  within  certain  tolerances  

•  New  classes  may  be  created  if  a  good  match  is  not  found  in  the  repository  of  known  devices  

•  Devices  age  out,  based  on  data  reten8on  policy      5/2/13   4  

Page 5: Rapid pruning of search space through hierarchical matching

Task  Challenges  

•  Extremely  vola8le  aSributes  •  There  are  no  pivot  aSributes  to  divide  and  conquer  the  search  space    

•  Changing  distribu8ons  •  Emphasis  on  PRECISION  •  Stringent  RESPONSE  8me  

5/2/13   5  

Page 6: Rapid pruning of search space through hierarchical matching

Engineering  Challenges  

•  Precision  (accuracy)  and  latency  (response  8me)  are  antagonis8c  constraints  

•  Project  management    

Repository  Size  (millions)  

Load  (TPS)  

Latency  (ms)  

Project  start   28   200     <  100  

Present   280   300     <  100  

Change   10  X   1.5  X   None  

5/2/13   6  

Page 7: Rapid pruning of search space through hierarchical matching

Approaches  

•  Rules  engine  •  Learning  models  •  Vector  space  models        Need  an  enterprise  grade  solu8on!  

5/2/13   7  

Page 8: Rapid pruning of search space through hierarchical matching

Rules  Engine  

•  No  experts  •  Number  of  rules?  •  Maintenance?  

Not  a  viable  approach!    

5/2/13   8  

Page 9: Rapid pruning of search space through hierarchical matching

Learning  Models  

•  Most  machine  learning  methods  deal  predominantly  with  binary  classifica8on  problems  (eg.  fraud  /  not  fraud)  or  a  small  number  of  target  classes  

•  Few  exemplars  for  each  class  •  ASribute  values  may  be  unbounded    •  ASributes  may  not  follow  a  natural  progression  

   5/2/13   9  

Page 10: Rapid pruning of search space through hierarchical matching

Learning  Models  …  

•  Unsupervised  learning  such  as  clustering  methods  would  make  good  models,  but  not  good  enough  to  be  of  prac8cal  use.  Any  simplifica8on  process  will  compromise  on  accuracy  

•  Ability  to  explain  is  cri8cal  •  Tend  to  ignore  domain  knowledge    Challenge  in  providing  enterprise  solu8on  

5/2/13   10  

Page 11: Rapid pruning of search space through hierarchical matching

Thoughts  

•  No  comparable  applica8on  with  such  requirements  

•  Build  and  deploy  a  classifier  that  explains  itself  easily,  scales  temporally  and  offers  quick  response  

•  Use  domain  knowledge  to  guide  verifica8on  •  Improve  the  classifier  through  machine  learning  methods  by  analyzing  performance  in  the  field  

 5/2/13   11  

Page 12: Rapid pruning of search space through hierarchical matching

Vector-­‐Space  Models  

•  Similarity  based  search  make  vector-­‐space  model  a  good  choice  for  genera8ng  selec8ons  

•  Given  the  vola8le  nature  of  data,  informa8on  retrieval  (IR)  systems  can  adapt  easily  

•  Good  at  neighborhood  search    Sensi8ve  to  individual  aSribute  changes!  

5/2/13   12  

Page 13: Rapid pruning of search space through hierarchical matching

Sources  of  Inspira8on  

•  Lucene/Solr  features  •  Documenta8on  from  (erstwhile)  Lucid  Imagina8on  

•  Ease  with  which  Lucene/Solr  could  be  installed  and  explored  

 Very  short  learning  curve  for  novices!  

5/2/13   13  

Page 14: Rapid pruning of search space through hierarchical matching

Feature  Selec8on    

•  Primi8ve  and  derived  aSributes  •  Entropy  •  Distribu8on  

 

5/2/13   14  

Page 15: Rapid pruning of search space through hierarchical matching

Domain  

•  Devices  come  with  structural  informa8on  but  not  much  grammar  or  seman8cs  

•  Bag-­‐of-­‐words  (single  field)  approach  is  fast  but  not  precise  

•  Using  all  fields  is  precise  but  response  is  slow      Now  what?  

5/2/13   15  

Page 16: Rapid pruning of search space through hierarchical matching

Disjunc8on  Max  •  Matrix  of  all  possible  combina8ons  of  user  input  query  and  document  fields  

•  Transforms  into  a  Boolean  query  of  Disjunc8onMaxQueries  of  each  row  

•  Maximum  score  of  sub  clauses  Is  used  by  Disjunc8onMaxQuery  

•  No  single  term  in  user  input  dominates    This  is  needed!    Src:  SearchHub  and  LucidWorks        5/2/13   16  

Page 17: Rapid pruning of search space through hierarchical matching

DisMax  Experiments  (index  size  =  60  Million)  

Scenario  1  

mm=2    Solr  fields  =  {  a1,  a2,  a3  }  Values=  {  phrase1,  phrase2,  phrase3}    Must-­‐Match  Clauses  Latency:  YES  (35  ms)  Precision:  NO  (20%  failure)  

5/2/13   17  

Scenario  2  

mm  =  50  %  Solr  fields  =  {  a1  }  Values=  {  term1,  term2,  term3  ….  termn  }    Should-­‐Match  Clauses  Latency:  NO  (>  2  seconds)  Precision:  YES  (>  98%)  

Page 18: Rapid pruning of search space through hierarchical matching

Possible  Workaround    

•  Look-­‐ahead:  Customize  Lucene/Solr  to  do  a  branch-­‐and-­‐bound  search,  bail  out  on  some  lower  bound  score  

•  Minimize  candidates  for  DisMax  search  -­‐  reduce  total  number  of  Solr  instances  to  search  -­‐  reduce  total  number  of  disjunc8ve  terms    

 [  Empirical  es8mate:  tn  =  2  *  tn-­‐1      where  t  =  8me  &            n  =  number  of  disjunc8ve  terms]  

5/2/13   18  

Page 19: Rapid pruning of search space through hierarchical matching

Phrases  over  Terms  

•  Used  coloca8on  (co-­‐occurrence  matrix)  to  determine  most  common  phrases  

•  Delete  terms  covered  by  phrases  •  Add  stop  words  based  on  frequency  analysis  •  Ensure  precision  is  preserved  through  regression  tests  

 Reduced  the  number  of  DisMax  terms  by  30%  

5/2/13   19  

Page 20: Rapid pruning of search space through hierarchical matching

Sources  of  Inspira8on  

•  Planning  in  a  Hierarchy  of  Abstrac8on  Spaces,  Ar8ficial  Intelligence,  Vol.  5,  No.  2,  pp.  115-­‐135  (1974)    

•  Search  Reduc8on  in  Hierarchical  Problem  Solving,  Proc.  Of  the  9th  IJCAI,  AAAI  Press,  Menlo  Park,  CA  (1991)  

•  Excep8onal  Data  Quality  Using  Intelligent  Matching  and  Retrieval,  AI  Magazine,  AAAI  Press  (Spring  2010)  

5/2/13   20  

Page 21: Rapid pruning of search space through hierarchical matching

Hierarchical  Matching  

Bag  of  words                        

Models   Phrases  

Filters   DisMax  

Query  Formulator  

Domain-­‐specific  paSerns  

   

CSV/JSON  

Solr    instances  selector  

To  Solr  Servers  

5/2/13  21  

Verifica8on  

Page 22: Rapid pruning of search space through hierarchical matching

Conflict  Resolu8on  

•  Top  n  candidates  are  returned  from  each  Solr  instance  

•  They  are  ranked  based  on  custom  verifica8on  module  

•  Ties  are  broken  using  recency  •  Top  candidate  is  persisted  and  returned  along  with  custom  score  

5/2/13   22  

Page 23: Rapid pruning of search space through hierarchical matching

Comments  

•  Dismax  performs  mul8dimensional  match  •  Extracted  mul8ple  filters  and  arranged  them  hierarchically  

•  Separa8on  of  selec8on  and  evalua8on  -­‐  Selec8on  =  approximate  solu8on  -­‐  Evalua8on  =  refinement  

5/2/13   23  

Page 24: Rapid pruning of search space through hierarchical matching

Where  8me  went..  

•  ASribute  selec8on  •  Ranking    •  Op8miza8on  •  Index  re-­‐genera8on    •  Regression  tes8ng  

5/2/13   24  

Page 25: Rapid pruning of search space through hierarchical matching

Sources  for  Tune  Up  

•  Scaling  Solr,  Lucene  Revolu8on,  May  2011    •  Prac8cal  Search  with  Solr:  Beyond  just  Looking  it  Up,  Lucid  Imagina8on,  May  2010  

5/2/13   25  

Page 26: Rapid pruning of search space through hierarchical matching

Tes8ng  

•  Precision  tes8ng  using  self  and  mixed  modes  •  Latency  tests    

-­‐  custom  harness  for  stand-­‐alone  tests  -­‐  integrated  tests  with  JMeter  framework  

5/2/13   26  

Page 27: Rapid pruning of search space through hierarchical matching

 

Results  

5/2/13   27  

Page 28: Rapid pruning of search space through hierarchical matching

Latency  Percen8les  

original  edismax  Ini8al  solu8on  

Op8miza8on  2:  Domain  paSerns,    Stop  words,  de-­‐dupe  

Op8miza8on  1:  Filters,  Focused  search,  verifica8on  

5/2/13   28  

Page 29: Rapid pruning of search space through hierarchical matching

TPS  

5/2/13   29  

Page 30: Rapid pruning of search space through hierarchical matching

Response  Times  over  Time  

5/2/13   30  

Page 31: Rapid pruning of search space through hierarchical matching

Project  Execu8on  

•  Agile  Methodology  •  Risk  mi8ga8on  through  primary  and  con8ngency  plans  

•  Rapid  prototyping  followed  by  good  sozware  engineering  prac8ces  

•  Evalua8ng  DSE  (DataStax)  &  Solr  Cloud    

5/2/13   31  

Page 32: Rapid pruning of search space through hierarchical matching

Gleanings  

•  You  can  classify  anything  with  Lucene/Solr,  lexicon  is  your  own  

•  The  ques8on  is  not  whether  Lucene/Solr  can  solve  a  par8cular  classifica8on  problem,  but  whether  you  can  priori8ze  among  the  many  ways  of  doing  it  

•  If  you  run  into  a  problem,  someone  has  solved  it  or  will  solve  it  in  the  near  future  

 5/2/13   32  

Page 33: Rapid pruning of search space through hierarchical matching

Gleanings  …  

•  Deal  with  accuracy  before  latency  •  If  precision,  latency  and  scale  are  all  cri8cal  to  your  domain,  expect  to  invest  some8me  in  hierarchical  abstrac8ons  

•  Index  once,  run  any8me,  anywhere,  does  not  apply  during  development  

•  Throwing  all  data  at  Lucene/Solr  will  not  work  for  mission  cri8cal  applica8ons  

•  Rapid  prototyping  and  willingness  to  fail  

5/2/13   33  

Page 34: Rapid pruning of search space through hierarchical matching

Summary  

     

Simplify  and  match  at  mul0ple  levels  of  abstrac0on  

 

5/2/13   34  

Page 35: Rapid pruning of search space through hierarchical matching

Contributors  

Chandra  Mouleeswaran  Research  &  Prototyping  

Fang  Chen  Research  &  Prototyping  

Luke  Mertens  Produc8za8on  &  Scalability  

Brent  Pearson  Release  Management  

Tracy  Hsu  Precision  Tes8ng  &  QA  

5/2/13   35  

Srinivas  Nayani  Deployment  &  QA  

Page 36: Rapid pruning of search space through hierarchical matching

COMMENTS & FEEDBACK: Chandra Mouleeswaran [email protected]

5/2/13   36