Teradata Big Data London Seminar

29
UNIFIED DATA ARCHITECTURE Chris Hillman Teradata Principal Data Scientist

description

Unified Data Architecture - Teradata presentation on the topic of Big Data and Apache Hadoop.

Transcript of Teradata Big Data London Seminar

Page 1: Teradata Big Data London Seminar

UNIFIED DATA ARCHITECTURE

Chris Hillman Teradata Principal Data Scientist

Page 2: Teradata Big Data London Seminar

2 4/23/12 Teradata Confidential

Need for a Unified Data Architecture for New InsightsEnabling Any User for Any Data Type from Data Capture to Analysis

Java, C/C++, Python, R, SAS, SQL, Excel, BI, Visualization

Discover and ExploreReporting and Execution

in the Enterprise

Capture, Store and Refine

Audio/Video

Images Docs TextWeb & Social

Machine Logs

CRM SCM ERP

Page 3: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.3

AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP

DISCOVERY PLATFORM

INTEGRATED DATA WAREHOUSE

UNIFIED DATA ARCHITECTURE

LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS

Engineers

Data Scientists

Business Analysts

Front-Line WorkersCustomers / PartnersQuants

Operational SystemsExecutives

Page 4: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.4

• Single View of Your Business

• Cross-Functional Analysis

• Shared Source for Analytics

• Load Once, Use Many Times

• Highest Business Value

• Lowest Total Cost of Ownership

• Fastest Time-to-Market For New Apps

Requirements for an Integrated Data Warehouse

Business Analysts

Knowledge Workers

DATA MININGBUSINESS INTELLIGENCE APPLICATIONS

Customers/Partners

Marketing

ExecutivesFront-line Workers

Operational Systems

INTEGRATED DATA WAREHOUSE

Page 5: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.5

Requirements of a Discovery Platform

DATA SOURCES

Structured Data

Multi-Structured

Data

Non- Relational

Data

DISCOVERY DISCOVERY TOOLS USERS

Discovery Platform Data

Scientist

BusinessAnalyst

SQL

MapReduce

Statistical Functions

OLTPDBMS’s

• Structured and multi-structured data

• Doesn’t require extensive data modeling

• Doesn’t balance the books

• Data completeness can be good enough

• No stringent SLAs

• Fraud patterns• Customer behavior• Digital marketing

optimization• Supply chain and

supply line sensors

Page 6: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.6

AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP

DISCOVERY PLATFORM

CAPTURE | STORE | REFINE

INTEGRATED DATA WAREHOUSE

UNIFIED DATA ARCHITECTURE

Big Data Analytics

Big Data Management

LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS

Engineers

Data Scientists

Business Analysts

Front-Line WorkersCustomers / PartnersQuants

Operational SystemsExecutives

Page 7: Teradata Big Data London Seminar

E-MAIL STORE SVP SURVEY ON-LINE BRANCH DATA CALL CENTER ATM PROFILE

Golden Path Application SubmitFraud Sentiment Analysis

Multi-Channel Customer BehaviorChannel HopingAttrition Paths

Fraudulent PathsDigital Marketing Attribution

ProductionizeAnalytic Score with Path Variable

Event TriggersMarketing Integration

Customer Behavior AnalysisMySpending Report

Customer SegmentationCredit Risk Analysis

Customer profitabilityPortfolio Analysis

DISCOVERY PLATFORM

CAPTURE | STORE | REFINE

INTEGRATED DATA WAREHOUSE

TERADATA UNIFIED DATA ARCHITECTURE

LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS

Engineers

Data Scientists

Business Analysts

Front-Line WorkersCustomers / PartnersQuants

Operational SystemsExecutives

ConsumerizationSessionization

Cross Platform Aggregation

Page 8: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.8 AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP

DISCOVERY PLATFORM

CAPTURE | STORE | REFINE

INTEGRATED DATA WAREHOUSE

LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS

Engineers

Data Scientists

Business Analysts

Front-Line WorkersCustomers / PartnersQuants

Operational SystemsExecutives

SQL-H

TERADATA UNIFIED DATA ARCHITECTURE

Page 9: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.9

SQL-H In ActionJoin Teradata, Hadoop, Aster tables; feed into Map ReduceSELECT qrd_focus_area, count(*)

FROM nPath(

ON (

SELECT * FROM

( SELECT * FROM load_from_teradata(

ON mr_driver TDPID(‘dbc’)

USERNAME(‘name1’) PASSWORD(‘password1’)

QUERY(‘SELECT * FROM owner.prod_own_fact’) ) ) AS td

JOIN owner.prod_dim proddim ON td.prod_id = proddim.product_id

JOIN

( SELECT * FROM load_from_hadoop(

ON mr_driver SERVER ('10.10.3.139')

USERNAME (‘name2') DBNAME (‘repair')

TABLENAME ('transaction') ) ) AS sqlh

ON sqlh.prod_ident_nbr = proddim.id )

PARTITION BY party_id, prod_id ORDER BY repair_dt

MODE (OVERLAPPING)

PATTERN ( ‘REPAIR{3}' )

SYMBOLS ( event = ‘REPAIR’ AS REPAIR )

RESULT (ACCUMULATE(qrd_focus_area OF ANY(REPAIR)) AS qrd_focus_area_path )

) n

GROUP BY 1 ORDER BY 2 desc ;

SQL manipulation for calculation

TD Connector to get OWNERSHIP data

Any path you want, specified with the power of regular expressions!

Hadoop Connector to get WARRANTY data

Include local Aster tables in JOIN

Page 10: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.10 AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP

DISCOVERY PLATFORM

CAPTURE | STORE | REFINE

INTEGRATED DATA WAREHOUSE

LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONSVIEWPOINT SUPPORT

Engineers

Data Scientists

Business Analysts

Front-Line WorkersCustomers / PartnersQuants

Operational SystemsExecutives

TERADATA UNIFIED DATA ARCHITECTURE

Aster Connector for Hadoop

Teradata Connector for Hadoop

Aster Teradata Connector

SQL-H

Aster Loader Teradata Loader

Page 11: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.11

When to Use Which? The best approach by workload and data type

Processing as a Function of Schema Requirements and Stage of Data Pipeline

Low Cost Storage and Fast Loading

Data Pre-Processing,

Refining, Cleansing

“Simple math at scale”

(Score, filter, sort, avg., count...)

Joins, Unions,

Aggregates

Analytics (Iterative and data mining)

Reporting

Stable Schema

Evolving Schema

Aster(SQL +

MapReduce Analytics)

Format, No Schema

Hadoop Hadoop Hadoop Aster AsterAster

(MapReduce Analytics)

Teradata/Hadoop Teradata Teradata Teradata Teradata Teradata

Hadoop Aster / Hadoop

Aster /Hadoop Aster Aster Aster

Hadoop Hadoop Hadoop Aster Aster Aster

Financial Analysis, Ad-Hoc/OLAPEnterprise-Wide BI and Reporting

Spatial/TemporalActive Execution

Interactive Data DiscoveryWeb Clickstream, Set-Top Box Analysis

CDRs, Sensor Logs, JSON

Social Feeds, Text, Image ProcessingAudio/Video Storage and Refining Storage and Batch Transformations

Page 12: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.12

When to Use Which? The best approach by workload and data type

Processing as a Function of Schema Requirements and Stage of Data Pipeline

Low Cost Storage and Fast Loading

Data Pre-Processing,

Refining, Cleansing

“Simple math at scale”

(Score, filter, sort, avg., count...)

Joins, Unions,

Aggregates

Analytics (Iterative and data mining)

Reporting

Stable Schema

Evolving Schema

Aster(SQL +

MapReduce Analytics)

Format, No Schema

Hadoop Hadoop Hadoop Aster AsterAster

(MapReduce Analytics)

Teradata/Hadoop Teradata Teradata Teradata Teradata Teradata

Hadoop Aster / Hadoop

Aster /Hadoop Aster Aster Aster

Hadoop Hadoop Hadoop Aster Aster Aster

Page 13: Teradata Big Data London Seminar

UDA IN PRACTICEIPTV QUALITY OF SERVICE

Page 14: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.14

Starting point: Complaints Data

Page 15: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.15

Churners – and data quality

Page 16: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.16

CREATE dimension table wrk.npath_reboot_5eventsAS SELECT path, COUNT(*) AS path_countFROM nPath

(ON wrk.w_event_f PARTITION BY srv_id ORDER BY evt_ts desc MODE (NONOVERLAPPING ) PATTERN ('X{0,5}.reboot') SYMBOLS

(true as X, evt_name = 'REBOOT' AS reboot) RESULT (FIRST( srv_id OF X) AS srv_id, ACCUMULATE (evt_name OF ANY (X,reboot))

AS path) ) GROUP BY 1 ;

SELECT * FROM GraphGen (ON

(SELECT * from wrk.npath_reboot_5events ORDER BY path_count LIMIT 30 )PARTITION BY 1ORDER BY path_count descitem_format('npath')item1_col('path') score_col('path_count') output_format('sankey')justify('right'));

Note number of paths with a reboot,

following another reboot!

What events lead up to a reboot?

Page 17: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.17

Looks like an issue with the data on the 30th September and beyond, the Reboot data for October seems to have been aggregated and added to September the 30th

View events data in Tableau

Page 18: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.18

• Remove paths will all reboots and exclude data from 30th September

Would appear that events with suffix 1 and 2 can be added together

Address data quality

Page 19: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.19

Size of Node = number of customersWidth of Edge = number of errors

SELECT * FROM graphgen (ON

(SELECT DISTINCT dmt_act_dslam, nra_id,

nbr_of_srvid, errorspersrv, nbr_of_dslam

FROM wrk.srvid_dslam_err) PARTITION BY 1 ORDER BY errorspersrv item_format('cfilter') item1_col('dmt_act_dslam') item2_col('nra_id') score_col('errorspersrv') cnt1_col('nbr_of_srvid') cnt2_col('nbr_of_dslam') output_format('sigma') directed('false') width_max(10) width_min(1) nodesize_max (3) nodesize_min (1));

Visualise as a Graph using Aster GraphGen

Page 20: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.20

Synch Issues by Hub Type

Page 21: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.21

Error and Complaint rates by equipment type

Page 22: Teradata Big Data London Seminar
Page 23: Teradata Big Data London Seminar

UDA IN PRACTICE PREDICTIVE MODELS

Page 24: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.24

create table wrk.cih_dshb_ads asSELECT srv_id, sav_flag, offer, inseecode, code_postal, libelle, nom_dep, nom_region, longitude, latitude, coalesce(topo_nra, 'Unknown') as topo_nra, topo_dslam, coalesce(iad_hardwareversion, 'Unknown') as iad_hardwareversion, coalesce(iad_manufacturer, 'Unknown') as iad_manufacturer, coalesce(iad_modelname , 'Unknown') as iad_modelname, coalesce(iad_modemfirmwareversion , 'Unknown') as iad_modemfirmwareversion, coalesce(iad_productclass , 'Unknown') as iad_productclass, coalesce(iad_provisioningcode , 'Unknown') as iad_provisioningcode, coalesce(iad_softwareversion , 'Unknown') as iad_softwareversion, coalesce(iad_vendorconfigfiledescription_1 , 'Unknown') as iad_vendorconfigfiledescription_1, coalesce(iad_vendorconfigfilename_1 , 'Unknown') as iad_vendorconfigfilename_1, coalesce(iad_vendorconfigfilenumbofentries , 0) as iad_vendorconfigfilenumbofentries, coalesce(iad_vendorconfigfileversion_1 , 'Unknown') as iad_vendorconfigfileversion_1, coalesce(iad_x_000e50_boardversion , 'Unknown') as iad_x_000e50_boardversion, coalesce(stb_description , 'Unknown') as stb_description, coalesce(stb_devicestatus , 'Unknown') as stb_devicestatus, coalesce(stb_gwinfoproductclass , 'Unknown') as stb_gwinfoproductclass, coalesce(stb_hardwareversion , 'Unknown') as stb_hardwareversion, coalesce(stb_manufacturer , 'Unknown') as stb_manufacturer, coalesce(stb_productclass , 'Unknown') as stb_productclass, coalesce( stb_softwareversion, 'Unknown') as stb_softwareversion, dev_iad_uptime_diff,dsl_showtime_diff,dev_stb_uptime_diff, kpi_iad_uptime,kpi_iad_synctime,kpi_stb_uptime, dev_iad_uptime,dsl_showtime,dev_stb_uptime, dsl_downstr_att,dsl_downstr_cur,dsl_downstr_max, kpi_voip_nb_dropped_calls_diff,kpi_voip_nb_dropped_calls,kpi_dsl_nb_crc,kpi_dsl_dscurrate_ratio_qualite, kpi_voip_tx_appels_coupes,kpi_voip_qualite,kpi_voip_qualite_diff,kpi_iptv_plr_nb_bon,kpi_iptv_plr_nb_moyen, ,kpi_iptv_conso_heures,kpi_iptv_packetslosts,kpi_iptv_packetsreceived, kpi_dsl_dscurrate_before,kpi_dsl_dscurrate_after, FROM wrk.cih_dshb_bis where network = 'BYT' and stb_manufacturer is not null and topo_dslam is not null

Input Data

Page 25: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.25

SELECT * FROM forest_drive(ON (SELECT 1) PARTITION BY 1 DATABASE('beehive') USERID('beehive') PASSWORD('beehive') INPUTTABLE('wrk.cih_dshb_tree_in') OUTPUTTABLE('wrk.cih_dshb_tree_out') RESPONSE('sav_flag') NUMERICINPUTS(‘KPI_SIGNAL') CATEGORICALINPUTS('offer', 'nom_dep', 'nom_region', 'topo_nra','topo_dslam' , 'iad_modemfirmwareversion','iad_vendorconfigfiledescription_1', 'iad_x_000e50_boardversion', 'stb_description', 'stb_productclass', 'stb_softwareversion', 'topo_dslam_brand') NUMTREES(4))

Decision Trees

Page 26: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.26

CREATE TABLE wrk.cih_dshb_model (PARTITION KEY(class)) ASSELECT * FROM naiveBayesReduce( ON(SELECT * FROM naiveBayesMap( ON (select * from wrk.cih_dshb_ads_in_11 where kpi_iad_uptime is not null) RESPONSE('sav_flag') NUMERICINPUTS('dev_iad_uptime','dsl_showtime','dev_stb_uptime','dsl_downstr_att','dsl_downstr_cur','dsl_downstr_max','kpi_voip_nb_dropped_calls_diff','kpi_voip_nb_dropped_calls','kpi_dsl_nb_crc','kpi_dsl_dscurrate_ratio_qualite','kpi_voip_tx_appels_coupes','kpi_voip_qualite','kpi_voip_qualite_diff','kpi_iptv_plr_nb_bon','kpi_iptv_plr_nb_moyen','kpi_iptv_plr_nb_mauvais','kpi_iptv_packetslosts','kpi_iptv_packetsreceived','kpi_stb_uptime','kpi_iad_synctime','kpi_iad_uptime') CATEGORICALINPUTS('offer', 'nom_dep', 'nom_region', 'topo_nra','topo_dslam' , 'iad_modemfirmwareversion','iad_vendorconfigfiledescription_1','iad_x_000e50_boardversion', 'stb_description', 'stb_productclass', 'stb_softwareversion', 'topo_dslam_brand') ) )PARTITION BY class);

Naïve Bayes

Page 27: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.27

create table wrk.cih_svm_train2 distribute by hash(srv_id) as select srv_id, 'topo_nra_insee' as attr, topo_nra_insee::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'code_postal' as attr, code_postal::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'kpi_iad_uptime_avg' as attr, kpi_iad_uptime_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'dev_iad_uptime_diff_avg' as attr, dev_iad_uptime_diff_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'kpi_voip_nb_dropped_calls_diff_avg' as attr, kpi_voip_nb_dropped_calls_diff_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'sav_nb_contacts' as attr, sav_nb_contacts::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'nb_tr' as attr, nb_tr::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'kpi_dsl_nb_crc_avg' as attr, kpi_dsl_nb_crc_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train;/*Run SVM*/

CREATE TABLE wrk.cih_svm_model3 (PARTITION KEY(vec_index)) ASSELECT vec_index, avg(vec_value) as vec_value FROMsvm( ON wrk.cih_svm_train2PARTITION BY srv_idOUTCOME( 'sav_flag' )ATTRIBUTE_NAME( 'attr' )ATTRIBUTE_VALUE( 'attr_value' ))GROUP BY vec_index;

Support Vector Machine

Page 28: Teradata Big Data London Seminar

Confidential and proprietary. Copyright © 2012 Teradata Corporation.28

Lift Chart to View Predictive Model Performance

Page 29: Teradata Big Data London Seminar