Rapid pruning of search space through hierarchical matching

Post on 31-May-2015

3.440 views 0 download

Tags:

description

Presented by Chandra Mouleeswaran, Co Chair at Intellifest.org, ThreatMetrix This talk will present our experiences in using Lucene/Solr to the classification of user and device data. On a daily basis, ThreatMetrix, Inc., handles a huge volume of volatile data. The primary challenge is rapidly and precisely classifying each incoming transaction, by searching a huge index within a very strict latency specification. The audience will be taken through the various design choices and the lessons learned. Details on introducing a hierarchical search procedure that systematically divides the search space into manageable partitions, yet maintaining precision, will be presented.

Transcript of Rapid pruning of search space through hierarchical matching

RAPID PRUNING OF SEARCH SPACE THROUGH HIERARCHICAL MATCHING Chandra Mouleeswaran Machine Learning Scientist, ThreatMetrix Inc.

5/2/13   1  

My  Background  •  Machine  Learning  Scien8st  at  ThreatMetrix  Inc.  •  Co-­‐  Chair,  Developer  Programs,  IntelliFest.org,  Oct  2013,  

San  Diego,  CA    Career  Path  -­‐  Siemens  Corporate  Research:  Learning  &  Expert  Systems  -­‐  Technology  division  of  Donaldson,  LuQin  and  JenreSe  

company  (Pershing):  Ar8ficial  Intelligence  Group  -­‐  Network  Monitoring  

-­‐  Several  startups:  Classifica8on,  Web  Crawling,  Security,  Financial  Trading  etc.  

5/2/13   2  

Outline  

•  Task  descrip8on  •  Approaches  •  Why  search  paradigm?  •  Hierarchical  matching    •  Results  •  Acknowledgments    

5/2/13   3  

The  Device  Iden8fica8on  Task  

•  Computa8onally,  it’s  a  CLASSIFICATION  problem:  {  a0,  a1,  a2,  a3………..  an  }    è  {  ci  }  ai  =  (  aSribute  |  field  |  key  )  value  ci  =  (  label  |  signature  |  class  |  hash  )  

•  Returning  devices  should  be  correctly  iden8fied  within  certain  tolerances  

•  New  classes  may  be  created  if  a  good  match  is  not  found  in  the  repository  of  known  devices  

•  Devices  age  out,  based  on  data  reten8on  policy      5/2/13   4  

Task  Challenges  

•  Extremely  vola8le  aSributes  •  There  are  no  pivot  aSributes  to  divide  and  conquer  the  search  space    

•  Changing  distribu8ons  •  Emphasis  on  PRECISION  •  Stringent  RESPONSE  8me  

5/2/13   5  

Engineering  Challenges  

•  Precision  (accuracy)  and  latency  (response  8me)  are  antagonis8c  constraints  

•  Project  management    

Repository  Size  (millions)  

Load  (TPS)  

Latency  (ms)  

Project  start   28   200     <  100  

Present   280   300     <  100  

Change   10  X   1.5  X   None  

5/2/13   6  

Approaches  

•  Rules  engine  •  Learning  models  •  Vector  space  models        Need  an  enterprise  grade  solu8on!  

5/2/13   7  

Rules  Engine  

•  No  experts  •  Number  of  rules?  •  Maintenance?  

Not  a  viable  approach!    

5/2/13   8  

Learning  Models  

•  Most  machine  learning  methods  deal  predominantly  with  binary  classifica8on  problems  (eg.  fraud  /  not  fraud)  or  a  small  number  of  target  classes  

•  Few  exemplars  for  each  class  •  ASribute  values  may  be  unbounded    •  ASributes  may  not  follow  a  natural  progression  

   5/2/13   9  

Learning  Models  …  

•  Unsupervised  learning  such  as  clustering  methods  would  make  good  models,  but  not  good  enough  to  be  of  prac8cal  use.  Any  simplifica8on  process  will  compromise  on  accuracy  

•  Ability  to  explain  is  cri8cal  •  Tend  to  ignore  domain  knowledge    Challenge  in  providing  enterprise  solu8on  

5/2/13   10  

Thoughts  

•  No  comparable  applica8on  with  such  requirements  

•  Build  and  deploy  a  classifier  that  explains  itself  easily,  scales  temporally  and  offers  quick  response  

•  Use  domain  knowledge  to  guide  verifica8on  •  Improve  the  classifier  through  machine  learning  methods  by  analyzing  performance  in  the  field  

 5/2/13   11  

Vector-­‐Space  Models  

•  Similarity  based  search  make  vector-­‐space  model  a  good  choice  for  genera8ng  selec8ons  

•  Given  the  vola8le  nature  of  data,  informa8on  retrieval  (IR)  systems  can  adapt  easily  

•  Good  at  neighborhood  search    Sensi8ve  to  individual  aSribute  changes!  

5/2/13   12  

Sources  of  Inspira8on  

•  Lucene/Solr  features  •  Documenta8on  from  (erstwhile)  Lucid  Imagina8on  

•  Ease  with  which  Lucene/Solr  could  be  installed  and  explored  

 Very  short  learning  curve  for  novices!  

5/2/13   13  

Feature  Selec8on    

•  Primi8ve  and  derived  aSributes  •  Entropy  •  Distribu8on  

 

5/2/13   14  

Domain  

•  Devices  come  with  structural  informa8on  but  not  much  grammar  or  seman8cs  

•  Bag-­‐of-­‐words  (single  field)  approach  is  fast  but  not  precise  

•  Using  all  fields  is  precise  but  response  is  slow      Now  what?  

5/2/13   15  

Disjunc8on  Max  •  Matrix  of  all  possible  combina8ons  of  user  input  query  and  document  fields  

•  Transforms  into  a  Boolean  query  of  Disjunc8onMaxQueries  of  each  row  

•  Maximum  score  of  sub  clauses  Is  used  by  Disjunc8onMaxQuery  

•  No  single  term  in  user  input  dominates    This  is  needed!    Src:  SearchHub  and  LucidWorks        5/2/13   16  

DisMax  Experiments  (index  size  =  60  Million)  

Scenario  1  

mm=2    Solr  fields  =  {  a1,  a2,  a3  }  Values=  {  phrase1,  phrase2,  phrase3}    Must-­‐Match  Clauses  Latency:  YES  (35  ms)  Precision:  NO  (20%  failure)  

5/2/13   17  

Scenario  2  

mm  =  50  %  Solr  fields  =  {  a1  }  Values=  {  term1,  term2,  term3  ….  termn  }    Should-­‐Match  Clauses  Latency:  NO  (>  2  seconds)  Precision:  YES  (>  98%)  

Possible  Workaround    

•  Look-­‐ahead:  Customize  Lucene/Solr  to  do  a  branch-­‐and-­‐bound  search,  bail  out  on  some  lower  bound  score  

•  Minimize  candidates  for  DisMax  search  -­‐  reduce  total  number  of  Solr  instances  to  search  -­‐  reduce  total  number  of  disjunc8ve  terms    

 [  Empirical  es8mate:  tn  =  2  *  tn-­‐1      where  t  =  8me  &            n  =  number  of  disjunc8ve  terms]  

5/2/13   18  

Phrases  over  Terms  

•  Used  coloca8on  (co-­‐occurrence  matrix)  to  determine  most  common  phrases  

•  Delete  terms  covered  by  phrases  •  Add  stop  words  based  on  frequency  analysis  •  Ensure  precision  is  preserved  through  regression  tests  

 Reduced  the  number  of  DisMax  terms  by  30%  

5/2/13   19  

Sources  of  Inspira8on  

•  Planning  in  a  Hierarchy  of  Abstrac8on  Spaces,  Ar8ficial  Intelligence,  Vol.  5,  No.  2,  pp.  115-­‐135  (1974)    

•  Search  Reduc8on  in  Hierarchical  Problem  Solving,  Proc.  Of  the  9th  IJCAI,  AAAI  Press,  Menlo  Park,  CA  (1991)  

•  Excep8onal  Data  Quality  Using  Intelligent  Matching  and  Retrieval,  AI  Magazine,  AAAI  Press  (Spring  2010)  

5/2/13   20  

Hierarchical  Matching  

Bag  of  words                        

Models   Phrases  

Filters   DisMax  

Query  Formulator  

Domain-­‐specific  paSerns  

   

CSV/JSON  

Solr    instances  selector  

To  Solr  Servers  

5/2/13  21  

Verifica8on  

Conflict  Resolu8on  

•  Top  n  candidates  are  returned  from  each  Solr  instance  

•  They  are  ranked  based  on  custom  verifica8on  module  

•  Ties  are  broken  using  recency  •  Top  candidate  is  persisted  and  returned  along  with  custom  score  

5/2/13   22  

Comments  

•  Dismax  performs  mul8dimensional  match  •  Extracted  mul8ple  filters  and  arranged  them  hierarchically  

•  Separa8on  of  selec8on  and  evalua8on  -­‐  Selec8on  =  approximate  solu8on  -­‐  Evalua8on  =  refinement  

5/2/13   23  

Where  8me  went..  

•  ASribute  selec8on  •  Ranking    •  Op8miza8on  •  Index  re-­‐genera8on    •  Regression  tes8ng  

5/2/13   24  

Sources  for  Tune  Up  

•  Scaling  Solr,  Lucene  Revolu8on,  May  2011    •  Prac8cal  Search  with  Solr:  Beyond  just  Looking  it  Up,  Lucid  Imagina8on,  May  2010  

5/2/13   25  

Tes8ng  

•  Precision  tes8ng  using  self  and  mixed  modes  •  Latency  tests    

-­‐  custom  harness  for  stand-­‐alone  tests  -­‐  integrated  tests  with  JMeter  framework  

5/2/13   26  

 

Results  

5/2/13   27  

Latency  Percen8les  

original  edismax  Ini8al  solu8on  

Op8miza8on  2:  Domain  paSerns,    Stop  words,  de-­‐dupe  

Op8miza8on  1:  Filters,  Focused  search,  verifica8on  

5/2/13   28  

TPS  

5/2/13   29  

Response  Times  over  Time  

5/2/13   30  

Project  Execu8on  

•  Agile  Methodology  •  Risk  mi8ga8on  through  primary  and  con8ngency  plans  

•  Rapid  prototyping  followed  by  good  sozware  engineering  prac8ces  

•  Evalua8ng  DSE  (DataStax)  &  Solr  Cloud    

5/2/13   31  

Gleanings  

•  You  can  classify  anything  with  Lucene/Solr,  lexicon  is  your  own  

•  The  ques8on  is  not  whether  Lucene/Solr  can  solve  a  par8cular  classifica8on  problem,  but  whether  you  can  priori8ze  among  the  many  ways  of  doing  it  

•  If  you  run  into  a  problem,  someone  has  solved  it  or  will  solve  it  in  the  near  future  

 5/2/13   32  

Gleanings  …  

•  Deal  with  accuracy  before  latency  •  If  precision,  latency  and  scale  are  all  cri8cal  to  your  domain,  expect  to  invest  some8me  in  hierarchical  abstrac8ons  

•  Index  once,  run  any8me,  anywhere,  does  not  apply  during  development  

•  Throwing  all  data  at  Lucene/Solr  will  not  work  for  mission  cri8cal  applica8ons  

•  Rapid  prototyping  and  willingness  to  fail  

5/2/13   33  

Summary  

     

Simplify  and  match  at  mul0ple  levels  of  abstrac0on  

 

5/2/13   34  

Contributors  

Chandra  Mouleeswaran  Research  &  Prototyping  

Fang  Chen  Research  &  Prototyping  

Luke  Mertens  Produc8za8on  &  Scalability  

Brent  Pearson  Release  Management  

Tracy  Hsu  Precision  Tes8ng  &  QA  

5/2/13   35  

Srinivas  Nayani  Deployment  &  QA  

COMMENTS & FEEDBACK: Chandra Mouleeswaran cmouleeswaran@threatmetrix.com

5/2/13   36