Tech Lab Series - Episode II - Back to Normal
-
Upload
inside-analysis -
Category
Documents
-
view
186 -
download
4
Transcript of Tech Lab Series - Episode II - Back to Normal
![Page 1: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/1.jpg)
Grab some coffee and
enjoy the
pre-show
banter
before the top of the
hour!
![Page 2: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/2.jpg)
Episode 2: Back to Normal Tech Lab Webcast | September 24, 2014
Sponsored by
![Page 3: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/3.jpg)
u Real-‐world proving ground for enterprise soCware
u Designed to showcase the process of creaEng soluEons
u Completely independent of sponsor influence
u Run by Master ScienEst, Dr. Geoffrey Malafsky
u Projects span 3-‐6 months
What Is the Tech Lab?
![Page 4: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/4.jpg)
u Data NormalizaEon is a process by which disparate data sets, terms, models and ontologies can be reconciled for the purpose of providing cerEfiably accurate enterprise data.
What Is Data NormalizaEon?
![Page 5: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/5.jpg)
u Disparate Data Systems u Disparate File Structures u Disparate Data Models
u Variable Business Logic u ConflicEng Data Values
u Serious SemanEc Issues
Why Is NormalizaEon Necessary?
![Page 6: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/6.jpg)
u Robust plaYorm for data persistence
u RelaEvely easy to connect to enterprise apps
u Enables ‘future-‐proofing’ by avoiding lock-‐in
u Growing array of parallel processing funcEons
u New standard for data management
u No need to delete data, enabling roll-‐back
How Hadoop Can Help
![Page 7: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/7.jpg)
QuesEons?
![Page 8: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/8.jpg)
Thank you!
FIND THE ARCHIVE AT InsideAnalysis.com
![Page 9: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/9.jpg)
DATA SCIENCE AND HADOOP TO NORMALIZE CORPORATE DATA
![Page 10: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/10.jpg)
u Normalizing data is more sophisEcated than what is commonly done in integraEon
u It combines subject maaer knowledge, governance, business rules, and raw data.
u Small Data is “corporate structured data that is the fuel of its main ac2vi2es, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applica2ons, reports, and Business Intelligence.”
![Page 11: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/11.jpg)
The State of Corporate Data
multiple instances of source data
multiple definitions for reporting
multiple copies of data
variable structures
different data values
hidden conflicts in data definiEons
which source to use
different model types & standards
more storage , esp. when mulEplied by envinroments
more data flows to develop and maintain
more than 100 DW or data marts downstream
different methods for ETL
complex dependencies, difficult for impact assessment
conflicEng business logic & views
global analyses & aggregaEons restricted by inconsistencies
Copyright PSIKORS InsEtute 2013 11
![Page 12: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/12.jpg)
Copyright PSIKORS InsEtute 2014 12
![Page 13: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/13.jpg)
Data NormalizaEon Showcase
u FPDS is an open source of Federal Procurement data that has poor quality and consistency. – Approx 10M+ records each with 306 columns = 25GB raw text
– Structured data except for some free text fields u We are normalizing it for analysis of IT expenditures for a real client
u Queries are used by analysts supported by Hadoop environment via Data NormalizaEon plaYorm
![Page 14: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/14.jpg)
NormalizaEon Begins with Understanding Data
u Databases are supposed to have official informaEon on formal acquisiEon of IT assets. – Contracts DB not aligned with Procurement DB
• Example, FA330012Dxxx in one but not other
u Differing data sets and values – FA330012F0005: Same in both – FA330012P0020: Contracts DB: 10 items; FPDS: 1 item; Same descripEon, same total dollars
– HQ042312*: Contracts 6 = $278.4K, FPDS 1 = $48K • $48K is one of 6 records in Contracts
Copyright PSIKORS InsEtute 2014 14
![Page 15: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/15.jpg)
![Page 16: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/16.jpg)
ConverEng supposedly same primary keys into normalized values that can be compared: contract number
u If (DELIVERY_ORDER=NULL) v_piid = CONTRACT else v_piid = DELIVERY_ORDER
u If ( x1='0') v_modificaEon_number = '0‘ else v_modificaEon_number = x2 – where x1: if (ACO_MOD=NULL) x1 = x3 else x1 = ACO_MOD – where x3: if (PCO_MOD=NULL) x3='0‘ else x3=PCO_MOD – where x2: if (x4=NULL) x2='0‘ else x2=x4 – where x4: x4= LTRIM(x5) – where x5: x5=x1 – essenEally this first tries to use ACO_MOD, and if this is NULL then it tries to
use PCO_MOD and sets = '0' if these are NULL
u If (DELIVERY_ORDER=NULL) v_idv_piid = y1 else v_idv_piid = CONTRACT – where y1: y1 = REF_PROC_INSTRUMENT with all '-‐' characters
removed
Copyright PSIKORS InsEtute 2014 16
key business logic as buried in a database stored procedure (condensed)
![Page 17: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/17.jpg)
SQL Queries via Hue: Impala
![Page 18: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/18.jpg)
SQL Queries via Hue: Hive
![Page 19: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/19.jpg)
Querying Impala From Data NormalizaEon System
![Page 20: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/20.jpg)
Simplifying Queries and Tying to AuthoritaEve Management
![Page 21: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/21.jpg)
Storing Term Rules in Master Codes
Note wildcard character (*) in middle as well as
front and back
![Page 22: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/22.jpg)
SELECT recordid,contracEngagencyid,contracEngagencyname,orgcode,orgid,modificaEonnumber,piid,piidagencyid,solicitaEonid,effecEvedate,fiscalyear,fundingagencyid,fundingagencyname,typeofcontract,consolidatedcontractdesc,descofreq,naicscode,naicsdesc,productorservicecode,productorservicedesc,globaldunsnumber,dunsnumber,globalvendorname,vendorname,datesigned,referencedidvpiid,referencedidvagencyid,referencedidvmodnumber,contracEngdepartmenEd,contracEngdepartmentname,contracEngofficeid,contracEngofficename,contracEngofficeregion,funcdimenddate,funcdimstartdate,funcEon1,funcEon1value,funcEon2,funcEon2value,funcEon3,funcEon3value,majorcommandcode,majorcommandid,majorcommandname,parentmacomcode,primarydimensionid,primarydimensionvalueid,secondarydimensionid,secondarydimensionvalueid,subcommand1code,subcommand1id,subcommand1name,subcommand2code,subcommand2id,subcommand2name,subcommand3code,subcommand3id,subcommand3name,subcommand4code,subcommand4id,subcommand4name,terEarydimensionid,terEarydimensionvalueid,transacEonnumber,lastdatetoorder,compleEondate,estulEmatecompleEondate,signeddate,fundingofficeid,fundingofficename,isfundedforeignenEtycode,isfundedforeignenEtydesc,reasoninteragencycontracEng,feeforuseofservice,fixed,lowervalue,maximumorderlimit,orderingprocedure,uppervalue,websiteurl,whocanuse,feepaidforuseofidv,programacronym,typeofidc,a76acEoncode,a76acEondesc,conEngencyhumanitarianpeaceop,contracYinancing,costacctstdclausecode,costacctstdclausedesc,costorpricingdata,emailaddress,gfegfpcode,gfegfpdesc,inherentlygovernmentaldesc,inherentlygovernmentalfuncEon,leaercontractundefacEoncode,leaercontractundefacEondesc,majorprogram,mulEpleorsingleawardidv,mulEyearcontractcode,mulEyearcontractdesc,naEonalinterestacEon,naEonalinterestdesc,numberofacEons,performancebasedserviceacqcode,performancebasedserviceacqdesc,purchasecardpaymethodcode,purchasecardpaymethoddesc,seatransportaEon,subcontractplan,treasuryacctsymbolagencyid,treasuryacctsymboliniEaEve,treasuryacctsymbolmaincode,treasuryacctsymbolsubcode,clingercohenactcode,clingercohenactdesc,davisbaconactcode,davisbaconactdesc,economyact,interagencycontracEngauthcode,interagencycontracEngauthdesc,otherstatutoryauthdesc,servicecontractactdesc,servicecontractactcode,walshhealeyactcode,walshhealeyactdesc,bundledreqs,claimantprogramcode,consolidatedcontractcode,domesEcorforeignenEtycode,domesEcorforeignenEtydesc,infotechcommercialitemcategory,recoveredmaterialssustain,recoveredmaterialssustaindesc,systemequipmentcode,useofepadesignatedproducts,congrdistrictplaceofperf,placeofperfzipcode,princplaceofperfcityname,princplaceofperfcountrycode,princplaceofperfcountryname,princplaceofperfcountycode,princplaceofperfcountyname,princplaceofperflocaEoncode,princplaceofperfstatecode,countryprodserviceorigincode,placeofmanufacture,placeofmanufacturedesc,alternaEveadverEsing,commercialitemacqperoccode,commercialitemacqperocdesc,commercialitemtestprogram,commercialitemtestprogramdesc,evaluatedpreference,extentcompeted,fairopportunitylimitedsources,fedbizoppscode,fedbizoppsdesc,localareasetasidecode,localareasetasidedesc,numberofoffersreceived,otherthanfullopencompeEEon,preawardyosynopsis,priceevaluaEonpercentdiff,sbaorofppsynopsiswaiverpilot,sbirsar,smallbuscompdemoprog,solicitaEonperoc,typeofsetaside,awardoridvtype,createdvia,lastmodifiedby,lastmodifieddate,part8orpart13,preparedby,prepareddate,reasonformodificaEoncode,reasonformodificaEondesc,congrdistrictcontractor,contractorname,doingbusasname,samexcepEon,street,street2,vendorcity,vendorcountry,vendorphonenumber,vendorstate,zip,is1862landgrantcollege,is1890landgrantcollege,is1994landgrantcollege,isairportauth,isalaskannaEvecorpownedfirm,isalaskannaEveservicinginst,isamericanindianowned,isasianpacificamericanowned,isblackamericanowned,isbothcontractsandgrants,iscity,iscommdevelopedcorpownedfirm,iscommdevelopmentcorp,iscontracts,iscorporateenEtynoaaxexempt,iscorporateenEtytaxexempt,iscouncilofgovernments,iscountryofincorporaEon,iscounty,isdomesEcshelter,isdotcertdisbusent,iseducaEonalinst,isemergingsmallbus,isfederalagency,isfedfundedresanddevcorp,isforprofitorg,isforeigngovernment,isforeignownedandlocated,isfoundaEon,isgrants,ishispanicamericanowned,ishispanicservicinginst,isvendorhbcu,ishospital,ishousingauthpublictribal,isindiantribe,isintermunicipal,isinternaEonalorg,isinterstateenEty,islaborsurplusareafirm,islimitedliabilitycorp,islocalgovernmentowned,ismanufacturerofgoods,isminorityinsts,isminorityownedbus,ismunicipality,isnaEveamericanowned,isnaEvehawaiianorgownedfirm,isnaEvehawaiianservicinginst,isnonprofitorg,isotherminorityowned,isothernoYorprofitorg,ispartnershipllp,isplanningcommission,isportauth,isprivateuniversityorcollege,issbacert8ajointventure,issbacert8aprogparEcipant,issbacerthubzonefirm,issbacertsmalldisbus,isschooldistrict,isschoolofforestry,isselfcerEfedsmalldisbus,isservicedisabledvetownedbus,issmallagriculturalcooperaEve,issoleproprietorship,isstatecontrinsthigherlearn,isstateofincorporaEon,issubchapterscorp,issubcontasianindianamerowned,istheabilityoneprog,istownship,istransitauth,istribalcollege,istriballyowned,isusfederalgovernment,isusgovernmentenEty,isuslocalgovernment,isusstategovernment,isveteranownedbus,isveterinarycollege,isveterinaryhospital,iswomanownedbus,istypeecondiswosb,istypejventecondiswosb,istypejventwosb,istypewosb,contracEngo{ussizeselecEon,reasonnotawardedtosmallbus,reasonnotawardedtosmalldisbus,idvbundledreqs,idvcontracEngagencyid,idvcontracEngagencyname,idvcontracEngo{ussizesel,idvdepartmenEd,idvdepartmentname,idvmajorprogcode,idvmulEpleorsingleawardidv,idvnaicscode,idvnaicsdesc,idvpart8orpart13,idvprogacronym,idvreferencedidvagencycode,idvreferencedidvpiid,idvsubcontractplan,idvsubcontractplandesc,idvtypeofcontractpricing,idvtypeofcontractpricingdesc,idvtypeofidc,idvtypeofidcdesc,idvwhocanuse,idvwhocanusedesc,missing301,currentcontractvalue,acEonobligaEon,ulEmatecontractvalue FROM fpdsrawrecords.records WHERE ( ( ( LOWER(fundingagencyid) = '97as' ) ) AND ( ( LOWER(fiscalyear) = '2013' ) ) AND ( ( LOWER(productorservicecode) LIKE '70%' OR LOWER(productorservicecode) LIKE 'd3%' ) ) ) LIMIT 1000
Complicated Queries are OCen Needed Looking for a combinaEon of keywords with wildcards along with structured values
![Page 23: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/23.jpg)
Query Timing u Looking for combinaEons of text tokens (with wildcards) to known field values
u Queries are done both in Data NormalizaEon plaYorm and by command line interface on Hadoop server for Impala and Hive. Time differences are negligible but all Emes reported here are by CLI – Tables made for: text, Parquet, Parquet parEEoned by ‘fiscalyear’ (6 values) and ‘fundingagencyid’ (approx. 25 values)
![Page 24: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/24.jpg)
0
50
100
150
200
250
300
350
400
Hive Impala SQLServer
FPDS Hadoop Query Times Text Field (secs)
Text Parquet Parquet ParEEoned
EvaluaEng query performance in Hadoop relaEve to format and comparing to RDBMS
![Page 25: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/25.jpg)
0
50
100
150
200
250
100 LIMIT 1000 LIMIT NO LIMIT
FPDS TEXT QUERIES PER LIMIT (SECS)
Hive Text Impala Text Hive Parquet
Impala Parquet Hive Parquet Part Impala Parquet Part
![Page 26: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/26.jpg)
QUERY PERFORMANCE IMPROVEMENT WITH IMPALA
JusEn Erickson | Director, Product Management, Cloudera
![Page 27: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/27.jpg)
Impala’s Benefits u Unlocks BI/analyEcs on Hadoop
– InteracEve SQL in seconds – Highly concurrent to handle 100s of users
u NaEve Hadoop flexibility – No data migraEon, conversion, or duplicaEon required – Query exisEng Hadoop data – Run mulEple frameworks on the same data at the same Eme – Supports Parquet for best-‐of-‐breed columnar performance
u NaEve MPP query engine designed into Hadoop: – Unified Hadoop storage – Unified Hadoop metadata (uses Hive and HCatalog) – Unified Hadoop security – Fine-‐grained role-‐based access controls with Sentry
u Apache-‐licensed open source u Deployed across customers today
©2014 Cloudera, Inc. All Rights Reserved. 27
![Page 28: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/28.jpg)
Impala Architecture
u MPP query engine built naEvely into Hadoop
©2014 Cloudera, Inc. All Rights Reserved. 28
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL request
![Page 29: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/29.jpg)
Impala’s MulE-‐User over 9.5x Faster
©2014 Cloudera, Inc. All Rights Reserved. 29
![Page 30: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/30.jpg)
MulE-‐user hardware uElizaEon
©2014 Cloudera, Inc. All Rights Reserved. 30
![Page 31: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/31.jpg)
Performance Takeaways u Impala’s advantage expands with just 10 users to >9.5x nearest
compeEtor – Predominantly aaributable to CPU efficiency
u Does not parEcularly maaer which DAG is run for Hive – Shark (with Spark) and Tez produce very similar results – Both incrementally faster batch processing but not comparable to MPP databases – Difference is Spark is already proven with broad community and vendor adopEon
u Mid-‐term trends will further favor Impala’s design approach – More data sets move to memory (HDFS caching, in-‐memory joins, Intel joint roadmap) – CPU efficiency will increase in importance – NaEve code enables easy opEmizaEons for CPU instrucEon sets (e.g. floaEng point
operaEons, math operaEons, encrypt/decrypt) – The Intel joint roadmap helps support these opportuniEes
u Upcoming benchmark on latest releases demonstrate Impala’s this gap widening
©2014 Cloudera, Inc. All Rights Reserved. 31
![Page 32: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/32.jpg)
NORMALIZING THE DATA
![Page 33: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/33.jpg)
![Page 34: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/34.jpg)
Capture Business Rules and Make Visible, Changeable, and Useful
![Page 35: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/35.jpg)
![Page 36: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/36.jpg)
Custom MulE-‐Use NormalizaEon Methods Ready for Hadoop Parallel ExecuEon
![Page 37: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/37.jpg)
Data NormalizaEon Library Enables Rapid Build, Deploy, Change Cycles
![Page 38: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/38.jpg)
Special Programming for Hadoop
u Which Hadoop libraries? Intertwined so reference all.
u Otherwise: not much – HDFS filesystem – YARN containers
![Page 39: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/39.jpg)
![Page 40: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/40.jpg)
![Page 41: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/41.jpg)
Parallel Jobs
u Three ways to run parallel jobs – Launch mulEple Java sessions from command line
• Same as in Windows, Linux
– Use Cloudera Hue Job Designer • Easy and has management web pages
– Data NormalizaEon system • Coordinates governance, architecture, data models, codes, business rules • Define, submit YARN containers specifying Java jar, dicEonaries, source files
![Page 42: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/42.jpg)
Key Code Analysis – Invoice data sets extracted with correlaEon • CAGE: 984274, DUNS: 973437
– FPDS DUNS and Names extracted & correlated
• 158181 unique DUNS codes – Will be included in normalized composite IT Asset records
– Composite records for lookup added to Hadoop • By DUNS or Global DUNS: get all related DUNS, CAGE, names
• By CAGE: get all related DUNS, names • By name: get all related DUNS, CAGE, names
![Page 43: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/43.jpg)
Number CAGE Per DUNS Code
0.1
1
10
100
1000
10000
100000
1000000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 27 35 40 43 44 46 54 71 78 90 119
Number DUNS Codes With X CAGE Codes
One DUNS code has 119 CAGE
![Page 44: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/44.jpg)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
ToWAWF
Millions
CAGE Codes from LookUp File
Found NotFound
![Page 45: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/45.jpg)
0.1
1
10
100
1000
10000
100000
1000000
0 1 2 3 4 5
FPDS Number DUNS with N Global DUNS
0.1
1
10
100
1000
10000
100000
1 3 5 7 9 11 13 15 17 19 21 24 27 35 112
FPDS: Number DUNS with N Names
6849 instances for code = 12345678
7
![Page 46: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/46.jpg)
0.1
1
10
100
1000
10000
0 50 100 150 200 250
Num
ber G
lobal D
UNS
Number DUNS
FPDS: Number Global DUNS with N DUNS
0.1
1
10
100
1000
0 200 400 600 800 1000 1200 1400 Num
ber G
lobal D
UNS
Number Names
FPDS: Global DUNS with MulEple Names
![Page 47: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/47.jpg)
140827
13302
17363
942
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
DUNS GlobalDUNS
FPDS DUNS Code Matches to WAWF Codes
Found NotFound
![Page 48: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/48.jpg)
DUNS NGlobalDUNS Nnames
123456787 0 6849
136666505 0 112
790238851 0 96
103933453 1 35
103385519 1 33
005149120 1 27
067641597 1 25
005103494 0 24
332619535 0 24
020751082 1 22
054781240 1 22
621599893 1 21
790238638 0 21
834476079 1 21
FPDS DUNS With Most Names 123456787 miscellaneous foreign contractors 123456787 eEsalat c/o us consulate general dubai 123456787 boswedden house 123456787 turner engine controls b. v. 123456787 swissport hellas cargo s a 123456787 orbit couriers sa 123456787 goldair aviaEon handling s.a. 123456787 federal egov iae iniEaEve generic duns 123456787 federal egov iae iniEaEve -‐ generic duns 123456787 miscellaneous foreign contractorsan 123456787 prc-‐desoto 123456787 inversiones sochagota e.u. 123456787 comcel 123456787 transporte y servicio lucio 123456787 jesse james members only maxi taxi svc 123456787 club naval de oficiales 123456787 inchcape shipping services 123456787 dr. thalia abatzi 123456787 central asia development group 123456787 bennea-‐fouch and associates 123456787 noor al-‐sabah company 123456787 ait/arc infrasture soluEons 123456787 not available 123456787 77 construcEon company
136666505 adese genc petrol 136666505 amy lily chung 136666505 anderson erin ruth 136666505 andrew william knef 136666505 anduaga-‐arias laura 136666505 angelica m. de la cruz 136666505 anthony o'brien, 330531-‐5100194 136666505 batac belle 136666505 boaesini beth ms. 136666505 bouck shannon 136666505 bunn amy b. 136666505 carlene clark 136666505 cho, boong haeng 136666505 choe, sun young 136666505 chrisEna michajlyszyn 136666505 christopher cannon 136666505 christopher l. booth 136666505 chun, kil mo 136666505 conflict + transiEon consultancies 136666505 cozzone elaine 136666505 deborah p. carney 136666505 denihan patricia joann 136666505 dong sook mcgeorge, 690525-‐2716816 136666505 dorene d.lukewalton,pharm d. 136666505 dr. terry a. klein
![Page 49: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/49.jpg)
FPDS Global DUNS with Most Names & DUNS
GlobalDUNS NDUNS Nnames 877936518 12 27299 624770475 212 21866 148095086 80 21754 027079776 2 17128 103933453 86 17075 026157235 4 15694 963737366 106 15200 134303192 19 14481 067641597 108 13998 064680213 102 13809 077652761 93 12914 002204600 15 12570 039860122 44 12382 805258373 130 11995
GlobalDUNS NDUNS Nnames 624770475 212 21866 805258373 130 11995 012003349 128 9748 877987347 127 8253 057272486 124 6935 007250079 123 9076 071767334 123 9474 158140041 117 6671 019710586 116 8163 091441089 116 7813 616924770 116 7217 067641597 108 13998
![Page 50: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/50.jpg)
Prompted CollaboraEon and New Business InformaEon
u Showing these results prompted discussions leading to: – There are generic DUNS heavily used but these are being removed from use via policy changes
– System validaEon rules are not current with all policy – AddiEonal “rules” of how to track, audit, align, merge spread by email • All put back into Data NormalizaEon system and then into modified Java
u New results available over all data sets <1day
![Page 51: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/51.jpg)
ADDITIONAL INFORMATION
![Page 52: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/52.jpg)
Impala JusEn Erickson | Director, Product Management September 2014
©2014 Cloudera, Inc. All Rights Reserved. 52
![Page 53: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/53.jpg)
Impala Architecture: Query ExecuEon
u Request arrives via ODBC/JDBC/Hue GUI/Shell
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL request
©2014 Cloudera, Inc. All Rights Reserved. 53
![Page 54: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/54.jpg)
Impala Architecture: Query ExecuEon u Planner turns request into collecEons of plan fragments u Coordinator iniEates execuEon on impalad's local to data
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
Hive Metastore HDFS NN Statestore
©2014 Cloudera, Inc. All Rights Reserved. 54
![Page 55: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/55.jpg)
Impala Architecture: Query ExecuEon u Intermediate results are streamed between impalad’s u Query results are streamed back to client
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
Query Planner Query Coordinator Query Executor
HDFS DN HBase
query results
©2014 Cloudera, Inc. All Rights Reserved. 55
![Page 56: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/56.jpg)
Try It Out!
u 100% Apache-‐licensed open source u Downloads on hap://impala.io/: – Live online – VM – InstallaEon
u QuesEons/comments? – Community: hap://impala.io/community – Email: impala-‐[email protected]
©2014 Cloudera, Inc. All Rights Reserved. 56
![Page 57: Tech Lab Series - Episode II - Back to Normal](https://reader030.fdocuments.us/reader030/viewer/2022021322/55a78b761a28ab1a6e8b4681/html5/thumbnails/57.jpg)
©2014 Cloudera, Inc. All Rights Reserved. 57