Tech Lab Series - Episode II - Back to Normal

57
Grab some coee and enjoy the pre-show banter before the top of the hour!

Transcript of Tech Lab Series - Episode II - Back to Normal

Page 1: Tech Lab Series - Episode II - Back to Normal

Grab some coffee and

enjoy the

pre-show

banter

before the top of the

hour!

Page 2: Tech Lab Series - Episode II - Back to Normal

Episode  2:  Back  to  Normal  Tech  Lab  Webcast  |  September  24,  2014  

Sponsored  by  

Page 3: Tech Lab Series - Episode II - Back to Normal

u  Real-­‐world  proving  ground  for  enterprise  soCware  

u  Designed  to  showcase  the  process  of  creaEng  soluEons  

u  Completely  independent  of  sponsor  influence  

u  Run  by  Master  ScienEst,    Dr.  Geoffrey  Malafsky  

u  Projects  span  3-­‐6  months  

What  Is  the  Tech  Lab?  

Page 4: Tech Lab Series - Episode II - Back to Normal

u  Data  NormalizaEon  is  a  process  by  which  disparate  data  sets,  terms,  models  and  ontologies  can  be  reconciled  for  the  purpose  of  providing  cerEfiably  accurate  enterprise  data.  

What  Is  Data  NormalizaEon?  

Page 5: Tech Lab Series - Episode II - Back to Normal

u  Disparate  Data  Systems  u  Disparate  File  Structures  u  Disparate  Data  Models  

u  Variable  Business  Logic  u  ConflicEng  Data  Values  

u  Serious  SemanEc  Issues  

Why  Is  NormalizaEon  Necessary?  

Page 6: Tech Lab Series - Episode II - Back to Normal

u  Robust  plaYorm  for  data  persistence  

u  RelaEvely  easy  to  connect  to  enterprise  apps  

u  Enables  ‘future-­‐proofing’  by  avoiding  lock-­‐in  

u  Growing  array  of  parallel  processing  funcEons  

u  New  standard  for  data  management  

u  No  need  to  delete  data,  enabling  roll-­‐back  

How  Hadoop  Can  Help  

Page 7: Tech Lab Series - Episode II - Back to Normal

QuesEons?  

Page 8: Tech Lab Series - Episode II - Back to Normal

Thank  you!  

FIND  THE  ARCHIVE  AT    InsideAnalysis.com  

Page 9: Tech Lab Series - Episode II - Back to Normal

DATA  SCIENCE  AND  HADOOP  TO  NORMALIZE  CORPORATE  DATA  

   

Page 10: Tech Lab Series - Episode II - Back to Normal

u  Normalizing  data  is  more  sophisEcated  than  what  is  commonly  done  in  integraEon  

u  It  combines  subject  maaer  knowledge,  governance,  business  rules,  and  raw  data.    

u  Small  Data  is  “corporate  structured  data  that  is  the  fuel  of  its  main  ac2vi2es,  and  whose  problems  with  accuracy  and  trustworthiness  are  past  the  stage  of  being  alleged.  This  includes  financial,  customer,  company,  inventory,  medical,  risk,  supply  chain,  and  other  primary  data  used  for  decision  making,  applica2ons,  reports,  and  Business  Intelligence.”  

Page 11: Tech Lab Series - Episode II - Back to Normal

The  State  of  Corporate  Data  

multiple instances of source data

multiple definitions for reporting

multiple copies of data

variable  structures  

different  data  values  

hidden  conflicts  in  data  definiEons  

which  source  to  use  

different  model  types  &  standards  

more  storage  ,  esp.  when  mulEplied  by  envinroments  

more  data  flows  to  develop  and  maintain  

more  than  100  DW  or  data  marts  downstream  

different  methods  for  ETL  

complex  dependencies,  difficult  for  impact  assessment  

conflicEng  business  logic  &  views  

global  analyses  &  aggregaEons  restricted  by  inconsistencies  

Copyright  PSIKORS  InsEtute  2013  11  

Page 12: Tech Lab Series - Episode II - Back to Normal

Copyright  PSIKORS  InsEtute  2014  12  

Page 13: Tech Lab Series - Episode II - Back to Normal

Data  NormalizaEon  Showcase  

u FPDS  is  an  open  source  of  Federal  Procurement  data  that  has  poor  quality  and  consistency.    – Approx  10M+  records  each  with  306  columns  =  25GB  raw  text  

–  Structured  data  except  for  some  free  text  fields  u We  are  normalizing  it  for  analysis  of  IT  expenditures  for  a  real  client    

u Queries  are  used  by  analysts  supported  by  Hadoop  environment  via  Data  NormalizaEon  plaYorm  

Page 14: Tech Lab Series - Episode II - Back to Normal

NormalizaEon  Begins  with  Understanding  Data  

u Databases  are  supposed  to  have  official  informaEon  on  formal  acquisiEon  of  IT  assets.    –  Contracts  DB  not  aligned  with  Procurement  DB  

•  Example,  FA330012Dxxx  in  one  but  not  other  

u Differing  data  sets  and  values  –  FA330012F0005:    Same  in  both  –  FA330012P0020:  Contracts  DB:  10  items;  FPDS:  1  item;  Same  descripEon,  same  total  dollars  

– HQ042312*:    Contracts  6  =  $278.4K,    FPDS  1  =  $48K  •  $48K  is  one  of  6  records  in  Contracts  

Copyright  PSIKORS  InsEtute  2014  14  

Page 15: Tech Lab Series - Episode II - Back to Normal
Page 16: Tech Lab Series - Episode II - Back to Normal

ConverEng  supposedly  same  primary  keys  into  normalized  values  that  can  be  compared:  contract  number  

u  If  (DELIVERY_ORDER=NULL)  v_piid  =  CONTRACT        else  v_piid  =  DELIVERY_ORDER  

u  If  (  x1='0')  v_modificaEon_number  =  '0‘      else  v_modificaEon_number  =  x2  –  where  x1:    if  (ACO_MOD=NULL)  x1  =  x3        else  x1  =  ACO_MOD  –  where  x3:    if  (PCO_MOD=NULL)  x3='0‘              else  x3=PCO_MOD  –  where  x2:    if  (x4=NULL)  x2='0‘                                            else  x2=x4  –  where  x4:    x4=  LTRIM(x5)  –  where  x5:    x5=x1  –  essenEally  this  first  tries  to  use  ACO_MOD,  and  if  this  is  NULL  then  it  tries  to  

use  PCO_MOD  and  sets  =  '0'  if  these  are  NULL  

u  If  (DELIVERY_ORDER=NULL)  v_idv_piid  =  y1    else  v_idv_piid  =  CONTRACT  –  where  y1:    y1  =  REF_PROC_INSTRUMENT    with  all  '-­‐'  characters  

removed  

Copyright  PSIKORS  InsEtute  2014  16  

key  business  logic  as  buried  in  a  database  stored  procedure  (condensed)  

Page 17: Tech Lab Series - Episode II - Back to Normal

SQL  Queries  via  Hue:  Impala  

Page 18: Tech Lab Series - Episode II - Back to Normal

SQL  Queries  via  Hue:  Hive  

Page 19: Tech Lab Series - Episode II - Back to Normal

Querying  Impala  From  Data  NormalizaEon  System  

Page 20: Tech Lab Series - Episode II - Back to Normal

Simplifying  Queries  and  Tying  to  AuthoritaEve  Management  

Page 21: Tech Lab Series - Episode II - Back to Normal

Storing  Term  Rules  in  Master  Codes  

Note  wildcard  character  (*)  in  middle  as  well  as  

front  and  back  

Page 22: Tech Lab Series - Episode II - Back to Normal

SELECT  recordid,contracEngagencyid,contracEngagencyname,orgcode,orgid,modificaEonnumber,piid,piidagencyid,solicitaEonid,effecEvedate,fiscalyear,fundingagencyid,fundingagencyname,typeofcontract,consolidatedcontractdesc,descofreq,naicscode,naicsdesc,productorservicecode,productorservicedesc,globaldunsnumber,dunsnumber,globalvendorname,vendorname,datesigned,referencedidvpiid,referencedidvagencyid,referencedidvmodnumber,contracEngdepartmenEd,contracEngdepartmentname,contracEngofficeid,contracEngofficename,contracEngofficeregion,funcdimenddate,funcdimstartdate,funcEon1,funcEon1value,funcEon2,funcEon2value,funcEon3,funcEon3value,majorcommandcode,majorcommandid,majorcommandname,parentmacomcode,primarydimensionid,primarydimensionvalueid,secondarydimensionid,secondarydimensionvalueid,subcommand1code,subcommand1id,subcommand1name,subcommand2code,subcommand2id,subcommand2name,subcommand3code,subcommand3id,subcommand3name,subcommand4code,subcommand4id,subcommand4name,terEarydimensionid,terEarydimensionvalueid,transacEonnumber,lastdatetoorder,compleEondate,estulEmatecompleEondate,signeddate,fundingofficeid,fundingofficename,isfundedforeignenEtycode,isfundedforeignenEtydesc,reasoninteragencycontracEng,feeforuseofservice,fixed,lowervalue,maximumorderlimit,orderingprocedure,uppervalue,websiteurl,whocanuse,feepaidforuseofidv,programacronym,typeofidc,a76acEoncode,a76acEondesc,conEngencyhumanitarianpeaceop,contracYinancing,costacctstdclausecode,costacctstdclausedesc,costorpricingdata,emailaddress,gfegfpcode,gfegfpdesc,inherentlygovernmentaldesc,inherentlygovernmentalfuncEon,leaercontractundefacEoncode,leaercontractundefacEondesc,majorprogram,mulEpleorsingleawardidv,mulEyearcontractcode,mulEyearcontractdesc,naEonalinterestacEon,naEonalinterestdesc,numberofacEons,performancebasedserviceacqcode,performancebasedserviceacqdesc,purchasecardpaymethodcode,purchasecardpaymethoddesc,seatransportaEon,subcontractplan,treasuryacctsymbolagencyid,treasuryacctsymboliniEaEve,treasuryacctsymbolmaincode,treasuryacctsymbolsubcode,clingercohenactcode,clingercohenactdesc,davisbaconactcode,davisbaconactdesc,economyact,interagencycontracEngauthcode,interagencycontracEngauthdesc,otherstatutoryauthdesc,servicecontractactdesc,servicecontractactcode,walshhealeyactcode,walshhealeyactdesc,bundledreqs,claimantprogramcode,consolidatedcontractcode,domesEcorforeignenEtycode,domesEcorforeignenEtydesc,infotechcommercialitemcategory,recoveredmaterialssustain,recoveredmaterialssustaindesc,systemequipmentcode,useofepadesignatedproducts,congrdistrictplaceofperf,placeofperfzipcode,princplaceofperfcityname,princplaceofperfcountrycode,princplaceofperfcountryname,princplaceofperfcountycode,princplaceofperfcountyname,princplaceofperflocaEoncode,princplaceofperfstatecode,countryprodserviceorigincode,placeofmanufacture,placeofmanufacturedesc,alternaEveadverEsing,commercialitemacqperoccode,commercialitemacqperocdesc,commercialitemtestprogram,commercialitemtestprogramdesc,evaluatedpreference,extentcompeted,fairopportunitylimitedsources,fedbizoppscode,fedbizoppsdesc,localareasetasidecode,localareasetasidedesc,numberofoffersreceived,otherthanfullopencompeEEon,preawardyosynopsis,priceevaluaEonpercentdiff,sbaorofppsynopsiswaiverpilot,sbirsar,smallbuscompdemoprog,solicitaEonperoc,typeofsetaside,awardoridvtype,createdvia,lastmodifiedby,lastmodifieddate,part8orpart13,preparedby,prepareddate,reasonformodificaEoncode,reasonformodificaEondesc,congrdistrictcontractor,contractorname,doingbusasname,samexcepEon,street,street2,vendorcity,vendorcountry,vendorphonenumber,vendorstate,zip,is1862landgrantcollege,is1890landgrantcollege,is1994landgrantcollege,isairportauth,isalaskannaEvecorpownedfirm,isalaskannaEveservicinginst,isamericanindianowned,isasianpacificamericanowned,isblackamericanowned,isbothcontractsandgrants,iscity,iscommdevelopedcorpownedfirm,iscommdevelopmentcorp,iscontracts,iscorporateenEtynoaaxexempt,iscorporateenEtytaxexempt,iscouncilofgovernments,iscountryofincorporaEon,iscounty,isdomesEcshelter,isdotcertdisbusent,iseducaEonalinst,isemergingsmallbus,isfederalagency,isfedfundedresanddevcorp,isforprofitorg,isforeigngovernment,isforeignownedandlocated,isfoundaEon,isgrants,ishispanicamericanowned,ishispanicservicinginst,isvendorhbcu,ishospital,ishousingauthpublictribal,isindiantribe,isintermunicipal,isinternaEonalorg,isinterstateenEty,islaborsurplusareafirm,islimitedliabilitycorp,islocalgovernmentowned,ismanufacturerofgoods,isminorityinsts,isminorityownedbus,ismunicipality,isnaEveamericanowned,isnaEvehawaiianorgownedfirm,isnaEvehawaiianservicinginst,isnonprofitorg,isotherminorityowned,isothernoYorprofitorg,ispartnershipllp,isplanningcommission,isportauth,isprivateuniversityorcollege,issbacert8ajointventure,issbacert8aprogparEcipant,issbacerthubzonefirm,issbacertsmalldisbus,isschooldistrict,isschoolofforestry,isselfcerEfedsmalldisbus,isservicedisabledvetownedbus,issmallagriculturalcooperaEve,issoleproprietorship,isstatecontrinsthigherlearn,isstateofincorporaEon,issubchapterscorp,issubcontasianindianamerowned,istheabilityoneprog,istownship,istransitauth,istribalcollege,istriballyowned,isusfederalgovernment,isusgovernmentenEty,isuslocalgovernment,isusstategovernment,isveteranownedbus,isveterinarycollege,isveterinaryhospital,iswomanownedbus,istypeecondiswosb,istypejventecondiswosb,istypejventwosb,istypewosb,contracEngo{ussizeselecEon,reasonnotawardedtosmallbus,reasonnotawardedtosmalldisbus,idvbundledreqs,idvcontracEngagencyid,idvcontracEngagencyname,idvcontracEngo{ussizesel,idvdepartmenEd,idvdepartmentname,idvmajorprogcode,idvmulEpleorsingleawardidv,idvnaicscode,idvnaicsdesc,idvpart8orpart13,idvprogacronym,idvreferencedidvagencycode,idvreferencedidvpiid,idvsubcontractplan,idvsubcontractplandesc,idvtypeofcontractpricing,idvtypeofcontractpricingdesc,idvtypeofidc,idvtypeofidcdesc,idvwhocanuse,idvwhocanusedesc,missing301,currentcontractvalue,acEonobligaEon,ulEmatecontractvalue  FROM  fpdsrawrecords.records  WHERE  (  (  (  LOWER(fundingagencyid)  =  '97as'  )  )  AND  (  (  LOWER(fiscalyear)  =  '2013'  )  )  AND  (  (  LOWER(productorservicecode)  LIKE  '70%'  OR  LOWER(productorservicecode)  LIKE  'd3%'  )  )  )  LIMIT  1000  

Complicated  Queries  are  OCen  Needed  Looking  for  a  combinaEon  of  keywords  with  wildcards  along  with  structured  values  

Page 23: Tech Lab Series - Episode II - Back to Normal

Query  Timing  u Looking  for  combinaEons  of  text  tokens  (with  wildcards)  to  known  field  values  

u Queries  are  done  both  in  Data  NormalizaEon  plaYorm  and  by  command  line  interface  on  Hadoop  server  for  Impala  and  Hive.  Time  differences  are  negligible  but  all  Emes  reported  here  are  by  CLI  – Tables  made  for:  text,  Parquet,  Parquet  parEEoned  by  ‘fiscalyear’  (6  values)  and  ‘fundingagencyid’  (approx.  25  values)  

Page 24: Tech Lab Series - Episode II - Back to Normal

0  

50  

100  

150  

200  

250  

300  

350  

400  

Hive   Impala   SQLServer  

FPDS  Hadoop  Query  Times  Text  Field  (secs)  

Text   Parquet   Parquet  ParEEoned  

EvaluaEng  query  performance  in  Hadoop  relaEve  to  format  and  comparing  to  RDBMS  

Page 25: Tech Lab Series - Episode II - Back to Normal

0  

50  

100  

150  

200  

250  

100  LIMIT   1000  LIMIT   NO  LIMIT  

FPDS  TEXT  QUERIES  PER  LIMIT  (SECS)  

Hive  Text   Impala  Text   Hive  Parquet  

Impala  Parquet   Hive  Parquet  Part   Impala  Parquet  Part  

Page 26: Tech Lab Series - Episode II - Back to Normal

QUERY  PERFORMANCE  IMPROVEMENT  WITH  IMPALA  

JusEn  Erickson  |    Director,  Product  Management,  Cloudera  

Page 27: Tech Lab Series - Episode II - Back to Normal

Impala’s  Benefits  u  Unlocks  BI/analyEcs  on  Hadoop  

–  InteracEve  SQL  in  seconds  –  Highly  concurrent  to  handle  100s  of  users  

u  NaEve  Hadoop  flexibility  –  No  data  migraEon,  conversion,  or  duplicaEon  required  –  Query  exisEng  Hadoop  data  –  Run  mulEple  frameworks  on  the  same  data  at  the  same  Eme  –  Supports  Parquet  for  best-­‐of-­‐breed  columnar  performance  

u  NaEve  MPP  query  engine  designed  into  Hadoop:  –  Unified  Hadoop  storage  –  Unified  Hadoop  metadata  (uses  Hive  and  HCatalog)  –  Unified  Hadoop  security  –  Fine-­‐grained  role-­‐based  access  controls  with  Sentry  

u  Apache-­‐licensed  open  source  u  Deployed  across  customers  today  

©2014  Cloudera,  Inc.  All  Rights  Reserved.  27  

Page 28: Tech Lab Series - Episode II - Back to Normal

Impala  Architecture  

u MPP  query  engine  built  naEvely  into  Hadoop  

©2014  Cloudera,  Inc.  All  Rights  Reserved.  28  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  Hive  

Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  request  

Page 29: Tech Lab Series - Episode II - Back to Normal

Impala’s  MulE-­‐User  over  9.5x  Faster  

©2014  Cloudera,  Inc.  All  Rights  Reserved.  29  

Page 30: Tech Lab Series - Episode II - Back to Normal

MulE-­‐user  hardware  uElizaEon  

©2014  Cloudera,  Inc.  All  Rights  Reserved.  30  

Page 31: Tech Lab Series - Episode II - Back to Normal

Performance  Takeaways  u  Impala’s  advantage  expands  with  just  10  users  to  >9.5x  nearest  

compeEtor  –  Predominantly  aaributable  to  CPU  efficiency  

u  Does  not  parEcularly  maaer  which  DAG  is  run  for  Hive  –  Shark  (with  Spark)  and  Tez  produce  very  similar  results  –  Both  incrementally  faster  batch  processing  but  not  comparable  to  MPP  databases  –  Difference  is  Spark  is  already  proven  with  broad  community  and  vendor  adopEon  

u  Mid-­‐term  trends  will  further  favor  Impala’s  design  approach  –  More  data  sets  move  to  memory  (HDFS  caching,  in-­‐memory  joins,  Intel  joint  roadmap)  –  CPU  efficiency  will  increase  in  importance  –  NaEve  code  enables  easy  opEmizaEons  for  CPU  instrucEon  sets  (e.g.  floaEng  point  

operaEons,  math  operaEons,  encrypt/decrypt)  –  The  Intel  joint  roadmap  helps  support  these  opportuniEes  

u  Upcoming  benchmark  on  latest  releases  demonstrate  Impala’s  this  gap  widening  

©2014  Cloudera,  Inc.  All  Rights  Reserved.  31  

Page 32: Tech Lab Series - Episode II - Back to Normal

NORMALIZING  THE  DATA      

Page 33: Tech Lab Series - Episode II - Back to Normal
Page 34: Tech Lab Series - Episode II - Back to Normal

Capture  Business  Rules  and  Make  Visible,  Changeable,  and  Useful  

Page 35: Tech Lab Series - Episode II - Back to Normal
Page 36: Tech Lab Series - Episode II - Back to Normal

Custom  MulE-­‐Use  NormalizaEon  Methods  Ready  for  Hadoop  Parallel  ExecuEon  

Page 37: Tech Lab Series - Episode II - Back to Normal

Data  NormalizaEon  Library  Enables  Rapid  Build,  Deploy,  Change  Cycles  

Page 38: Tech Lab Series - Episode II - Back to Normal

Special  Programming  for  Hadoop  

u Which  Hadoop  libraries?    Intertwined  so  reference  all.  

u Otherwise:    not  much  – HDFS  filesystem  – YARN  containers  

Page 39: Tech Lab Series - Episode II - Back to Normal
Page 40: Tech Lab Series - Episode II - Back to Normal
Page 41: Tech Lab Series - Episode II - Back to Normal

Parallel  Jobs  

u Three  ways  to  run  parallel  jobs  – Launch  mulEple  Java  sessions  from  command  line  

•  Same  as  in  Windows,  Linux  

– Use  Cloudera  Hue  Job  Designer  •  Easy  and  has  management  web  pages  

– Data  NormalizaEon  system  •  Coordinates  governance,  architecture,  data  models,  codes,  business  rules  •  Define,  submit  YARN  containers  specifying  Java  jar,  dicEonaries,  source  files  

Page 42: Tech Lab Series - Episode II - Back to Normal

Key  Code  Analysis  –  Invoice  data  sets  extracted  with  correlaEon  • CAGE:  984274,      DUNS:  973437  

– FPDS  DUNS  and  Names  extracted  &  correlated  

• 158181  unique  DUNS  codes  – Will  be  included  in  normalized  composite  IT  Asset  records    

– Composite  records  for  lookup  added  to  Hadoop    •  By  DUNS  or  Global  DUNS:  get  all  related  DUNS,  CAGE,  names  

•  By  CAGE:  get  all  related  DUNS,  names  •  By  name:  get  all  related  DUNS,  CAGE,  names  

Page 43: Tech Lab Series - Episode II - Back to Normal

Number  CAGE  Per  DUNS  Code  

0.1  

1  

10  

100  

1000  

10000  

100000  

1000000  

1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   23   24   27   35   40   43   44   46   54   71   78   90   119  

Number  DUNS  Codes  With  X  CAGE  Codes  

One  DUNS  code  has  119  CAGE    

Page 44: Tech Lab Series - Episode II - Back to Normal

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

1.4  

ToWAWF  

Millions  

CAGE  Codes  from  LookUp  File  

Found   NotFound  

Page 45: Tech Lab Series - Episode II - Back to Normal

0.1  

1  

10  

100  

1000  

10000  

100000  

1000000  

0   1   2   3   4   5  

FPDS  Number  DUNS  with  N  Global  DUNS  

0.1  

1  

10  

100  

1000  

10000  

100000  

1   3   5   7   9   11   13   15   17   19   21   24   27   35   112  

FPDS:  Number  DUNS  with  N  Names  

6849  instances  for  code  =  12345678

7  

Page 46: Tech Lab Series - Episode II - Back to Normal

0.1  

1  

10  

100  

1000  

10000  

0   50   100   150   200   250  

Num

ber  G

lobal  D

UNS  

Number  DUNS  

FPDS:  Number  Global  DUNS  with  N  DUNS  

0.1  

1  

10  

100  

1000  

0   200   400   600   800   1000   1200   1400  Num

ber  G

lobal  D

UNS  

Number  Names  

FPDS:  Global  DUNS  with  MulEple  Names  

Page 47: Tech Lab Series - Episode II - Back to Normal

140827  

13302  

17363  

942  

0  

20000  

40000  

60000  

80000  

100000  

120000  

140000  

160000  

180000  

DUNS   GlobalDUNS  

FPDS  DUNS  Code  Matches  to  WAWF  Codes  

Found   NotFound  

Page 48: Tech Lab Series - Episode II - Back to Normal

DUNS   NGlobalDUNS   Nnames  

123456787   0   6849  

136666505   0   112  

790238851   0   96  

103933453   1   35  

103385519   1   33  

005149120   1   27  

067641597   1   25  

005103494   0   24  

332619535   0   24  

020751082   1   22  

054781240   1   22  

621599893   1   21  

790238638   0   21  

834476079   1   21  

FPDS  DUNS  With  Most  Names  123456787   miscellaneous  foreign  contractors  123456787   eEsalat  c/o  us  consulate  general  dubai  123456787   boswedden  house  123456787   turner  engine  controls  b.  v.  123456787   swissport  hellas  cargo  s  a  123456787   orbit  couriers  sa  123456787   goldair  aviaEon  handling  s.a.  123456787   federal  egov  iae  iniEaEve  generic  duns  123456787   federal  egov  iae  iniEaEve  -­‐  generic  duns  123456787   miscellaneous  foreign  contractorsan  123456787   prc-­‐desoto  123456787   inversiones  sochagota  e.u.  123456787   comcel  123456787   transporte  y  servicio  lucio  123456787   jesse  james  members  only  maxi  taxi  svc  123456787   club  naval  de  oficiales  123456787   inchcape  shipping  services  123456787   dr.  thalia    abatzi  123456787   central  asia  development  group  123456787   bennea-­‐fouch  and  associates  123456787   noor  al-­‐sabah  company  123456787   ait/arc  infrasture  soluEons  123456787   not  available  123456787   77  construcEon  company  

136666505   adese  genc  petrol  136666505   amy  lily  chung  136666505   anderson  erin  ruth  136666505   andrew  william  knef  136666505   anduaga-­‐arias  laura  136666505   angelica  m.  de  la  cruz  136666505   anthony  o'brien,  330531-­‐5100194  136666505   batac  belle  136666505   boaesini  beth  ms.  136666505   bouck  shannon  136666505   bunn  amy  b.  136666505   carlene  clark  136666505   cho,  boong  haeng  136666505   choe,  sun  young  136666505   chrisEna  michajlyszyn  136666505   christopher  cannon  136666505   christopher  l.  booth  136666505   chun,  kil  mo  136666505   conflict  +  transiEon  consultancies  136666505   cozzone  elaine  136666505   deborah  p.  carney  136666505   denihan  patricia  joann  136666505   dong  sook  mcgeorge,  690525-­‐2716816  136666505   dorene  d.lukewalton,pharm  d.  136666505   dr.  terry  a.  klein  

Page 49: Tech Lab Series - Episode II - Back to Normal

FPDS  Global  DUNS  with  Most  Names  &  DUNS  

GlobalDUNS   NDUNS   Nnames  877936518   12   27299  624770475   212   21866  148095086   80   21754  027079776   2   17128  103933453   86   17075  026157235   4   15694  963737366   106   15200  134303192   19   14481  067641597   108   13998  064680213   102   13809  077652761   93   12914  002204600   15   12570  039860122   44   12382  805258373   130   11995  

GlobalDUNS   NDUNS   Nnames  624770475   212   21866  805258373   130   11995  012003349   128   9748  877987347   127   8253  057272486   124   6935  007250079   123   9076  071767334   123   9474  158140041   117   6671  019710586   116   8163  091441089   116   7813  616924770   116   7217  067641597   108   13998  

Page 50: Tech Lab Series - Episode II - Back to Normal

Prompted  CollaboraEon  and  New  Business  InformaEon  

u Showing  these  results  prompted  discussions  leading  to:  –  There  are  generic  DUNS  heavily  used  but  these  are  being  removed  from  use  via  policy  changes  

–  System  validaEon  rules  are  not  current  with  all  policy  – AddiEonal  “rules”  of  how  to  track,  audit,  align,  merge  spread  by  email  •  All  put  back  into  Data  NormalizaEon  system  and  then  into  modified  Java  

u New  results  available  over  all  data  sets  <1day  

Page 51: Tech Lab Series - Episode II - Back to Normal

ADDITIONAL  INFORMATION  

Page 52: Tech Lab Series - Episode II - Back to Normal

Impala  JusEn  Erickson  |    Director,  Product  Management  September  2014  

©2014  Cloudera,  Inc.  All  Rights  Reserved.  52  

Page 53: Tech Lab Series - Episode II - Back to Normal

Impala  Architecture:  Query  ExecuEon  

u  Request  arrives  via  ODBC/JDBC/Hue  GUI/Shell  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  Hive  

Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  request  

©2014  Cloudera,  Inc.  All  Rights  Reserved.   53  

Page 54: Tech Lab Series - Episode II - Back to Normal

Impala  Architecture:  Query  ExecuEon  u  Planner  turns  request  into  collecEons  of  plan  fragments  u  Coordinator  iniEates  execuEon  on  impalad's  local  to  data  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Hive  Metastore   HDFS  NN   Statestore  

©2014  Cloudera,  Inc.  All  Rights  Reserved.   54  

Page 55: Tech Lab Series - Episode II - Back to Normal

Impala  Architecture:  Query  ExecuEon  u  Intermediate  results  are  streamed  between  impalad’s  u  Query  results  are  streamed  back  to  client  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  Hive  

Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

query  results  

©2014  Cloudera,  Inc.  All  Rights  Reserved.   55  

Page 56: Tech Lab Series - Episode II - Back to Normal

Try  It  Out!  

u 100%  Apache-­‐licensed  open  source  u Downloads  on  hap://impala.io/:  – Live  online  – VM  –  InstallaEon  

u QuesEons/comments?  – Community:  hap://impala.io/community  – Email:  impala-­‐[email protected]  

©2014  Cloudera,  Inc.  All  Rights  Reserved.  56  

Page 57: Tech Lab Series - Episode II - Back to Normal

©2014  Cloudera,  Inc.  All  Rights  Reserved.  57