Big Data - Strategy & ExecutionSocial Networking Sites Hive UDF’s Mahout HDFS Lib Staging Area...

9
1 Copyright © 2012 Tata Consultancy Services Limited Big Data: Strategy & Execution Ajay Bhargava Director Analytics & Big Data Insurance & Healthcare, TCS [email protected]

Transcript of Big Data - Strategy & ExecutionSocial Networking Sites Hive UDF’s Mahout HDFS Lib Staging Area...

Page 1: Big Data - Strategy & ExecutionSocial Networking Sites Hive UDF’s Mahout HDFS Lib Staging Area Hive O/P Tables HDFS Hive I/P Tables Pig ... •TCS Rules framework - Configuring Transformation

1Copyright © 2012 Tata Consultancy Services Limited

Big Data: Strategy & Execution

Ajay Bhargava Director – Analytics & Big Data

Insurance & Healthcare, TCS

[email protected]

Page 2: Big Data - Strategy & ExecutionSocial Networking Sites Hive UDF’s Mahout HDFS Lib Staging Area Hive O/P Tables HDFS Hive I/P Tables Pig ... •TCS Rules framework - Configuring Transformation

2

Big Data – Lessons from the practitioner’s desk

1. Verb (Analysis – what you do with Big Data) & the “other” Adjective-Noun (Business Insights/Value) is far more valuable than the Adjective-Noun (Big Data) itself

2. A critical prerequisite for Big Data Initiatives is a good Use Case

3. “C”- level support for sponsoring the business use case is critical to get your relationship with Big Data off the ground

4. Data as a Strategic Asset. Is P&C Insurer == Retailer in disguise?

5. Good Management of Big Data (Quality, Architecture, Governance, Security, Breadth & Depth) Quality of Analysis Better Value

6. SENSE & RESPOND are in a vicious, faster cycle than ever before

Page 3: Big Data - Strategy & ExecutionSocial Networking Sites Hive UDF’s Mahout HDFS Lib Staging Area Hive O/P Tables HDFS Hive I/P Tables Pig ... •TCS Rules framework - Configuring Transformation

3

Initial Interests from our Customers (sample set)

1. Customer-Centricity: CMO, CIO Cross-Channel (3600 view across Website (Direct), Agent, Call Center,

Social, MobileApp, eMail, Fax, Mail etc.) Marketing Experimentation – New Products & Services Multi-Dimensional Segmentation of 1 Recommend Next Best Offer Listening on Social – Sentiment & Weblog Analysis Cycle efficiencies from Prospect to Customer Call Center: Cost to Investment Center Gaming Data – Social Behavior – www.vwobservatory.com

2. Operations: COO, CFO, CCO Customer Service – Claims Cycle business process optimization Reduction in Business SLAs: ETL performance

3. Risk Management: CIO, CCO, CFO Fraud: Business Rules, Predictive Analytics, Text Mining, AR Mining Search: Structured & Unstructured

AUS

Bank

EUR

INS

US

BFSI

Page 4: Big Data - Strategy & ExecutionSocial Networking Sites Hive UDF’s Mahout HDFS Lib Staging Area Hive O/P Tables HDFS Hive I/P Tables Pig ... •TCS Rules framework - Configuring Transformation

4

Sentiment and Web log Analysis

POC Objective

Sentiment Analysis of in-house product (Kaching) using Facebook, Twitter and Gizmodo user comments

HADOOP Stack used for Implementation

•Data extraction from source data using Hadoop common libraries into HDFS

•Load data into HIVE from HDFS for performing sentiment analysis

•HIVE UDFs and Mahout libraries for sentiment analysis

•Apply business rules using Pig Scripts for web log analysis

•Pentaho Report Designer for reporting the results

Flat Files

Social Networking Sites

Hive

UDF’s

Mahout

LibHDFS

Staging

Area

Hive O/P

Tables

HDFS

Hive I/P

Tables

Pig

Scripts

Generated reports

using MS Excel

Pentaho Reports

SENTIMENT ANALYSIS

WEB LOG ANALYSIS

AUS

Bank

Page 5: Big Data - Strategy & ExecutionSocial Networking Sites Hive UDF’s Mahout HDFS Lib Staging Area Hive O/P Tables HDFS Hive I/P Tables Pig ... •TCS Rules framework - Configuring Transformation

5

Use Case 1

Get the Top 10 accessed websites by reading through the performance log

Use Case 2

Reading through the performance logs, find the typical number of transactions (count of websites accessed)

in a single user session

Use Case 3

Group customers based on their residential address, postal codes and identify the types of interaction

Sentiment and Web log Analysis

Job # Data (GB) # Records # Mappers # Reducers Processing Time

1 285 423,181,518 1061 285 01 hr 24 min 00 sec

Job # Data (GB) # Records # Mappers # Reducers Processing Time

1 285 423,181,518 1061 285 01 hr 42 min 00 sec

2 1.08 75,624,607 4 2 01 hr 01 min 43 sec

Job # Data (GB) # Records # Mappers # Reducers Processing Time

1 4.34 57,414,131 17 5 02 min 42 sec

2 4.89 90,986,110 19 5 03 min 14 sec

3 0.74 27,419,136 5 1 01 min 59 sec

4 0.0007 31,112 14 0 00 min 18 sec

AUS

Bank

Page 6: Big Data - Strategy & ExecutionSocial Networking Sites Hive UDF’s Mahout HDFS Lib Staging Area Hive O/P Tables HDFS Hive I/P Tables Pig ... •TCS Rules framework - Configuring Transformation

6

ETL Performance - Customer Data Validation

•Scoop/Java Frameworks - Load the Data to HDFS

•Map Reduce Jobs - Data Validation &Transformation

•TCS Workflow framework on Oozie - Workflow

Management

•TCS Rules framework - Configuring Transformation &

Validation Rules

“Customer Value Management” Application

receives customer data from upstream systems

in flat files. This data is extracted and loaded into

Data Warehouse (Teradata) using ETL

(Informatica) job.

The job includes

•Referential Integrity check

•Technical Validation

•Delta identification

•Business Validation

•Data Cleansing

•Data Transformation

•Data Enrichment

Perform single customer view (SCV) process

against the huge data in Teradata and finally a

Master qualification Process.

Problem Statement TCS Hadoop Solution

EUR

INS

Page 7: Big Data - Strategy & ExecutionSocial Networking Sites Hive UDF’s Mahout HDFS Lib Staging Area Hive O/P Tables HDFS Hive I/P Tables Pig ... •TCS Rules framework - Configuring Transformation

7

• 0.7 GB (600K Records) of incoming records (Customer and Policy) were validated

against 14 GB (64 Million) Records and transformed.

• Hadoop Infrastructure can horizontally scale by addition of any number of nodes

• Traditional ETL will have limitation (Not possible) to scale in processing TBs & PBs

• The data was extracted to HDFS from App Database for optimal performance.

Customer

~0.7 GB

.6 M Records

~0.7 GB

0.6 M Records

~0.7 GB

0.6 M Records

~0.7 GB

0.6 M Records

Policy

~0.7 GB

0.6 M Records

~0.7 GB

0.6 M Records

~0.7 GB

0.6 M Records

~0.7 GB

0.6 M Records

Existing Customers

~14 GB

64 M Records

~100 TB

459 B Records

~14 GB

64 M Records

~100 TB

459 B Records

Time Taken

(~Mins)

14

1200

(extrapolated)

45

2280

Hadoop

Traditional

ETL

Incoming data for validation Existing Data Elapsed Time

Actually executed scenarios and the results are consistent

Inference

ETL Performance - Customer Data ValidationEUR

INS

Page 8: Big Data - Strategy & ExecutionSocial Networking Sites Hive UDF’s Mahout HDFS Lib Staging Area Hive O/P Tables HDFS Hive I/P Tables Pig ... •TCS Rules framework - Configuring Transformation

8

Big Data – Execution Strategy – “Search” Use Case

1. Start with (Proof of Concept/Deployment) Big: Business Value Use Case : Simple Unstr Search

Small: Scope : Small set of PDF,DOC

Big: Data Platform/Architecture : EC2 9-node, Lucerne

Small: Investment : Miniscule

Big: Support/backing : CTO of 1 major LOB

Small: Team : 6

Big: Collaboration – IT & Business : Use case dictated

Small: Duration : ~ 3 months

Big: Data Management emphasis : No Priv./Qual. Issues

Small: Increments of Success : Move to Str Search

2. Reevaluate/Deploy

Use Case: e.g. Evaluate the impact of Wal-Mart bribery allegations on our Portfolio

US

BFSI

This capability is NOT POSSIBLE in the current environment. Unstructured Data is rarely

& inconsistently looked at, and people dependent.

Page 9: Big Data - Strategy & ExecutionSocial Networking Sites Hive UDF’s Mahout HDFS Lib Staging Area Hive O/P Tables HDFS Hive I/P Tables Pig ... •TCS Rules framework - Configuring Transformation

Thank You

[email protected]