Fraud Detection Architecture

Post on 21-Apr-2017

5.575 views 1 download

Transcript of Fraud Detection Architecture

Real Time Fraud DetectionPatterns and reference architectures

Ted Malaska // PSA Gwen Shapira // Software Engineer

2

• Intro• Review Problem• Quick overview of key technology• High level architecture• Deep Dive into NRT Processing• Completing the Puzzle – Micro-batch, Ingest and Batch

Overview

©2014 Cloudera, Inc. All rights reserved.

3©2014 Cloudera, Inc. All rights reserved.

• 15 years of moving data• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume

• @gwenshap

Gwen Shapira

4

• Ted Malaska (PSA at Cloudera)• Hadoop for ~5 years• Contributed to

– HDFS, MapReduce, Yarn, HBase, Spark, Avro, – Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo – And working on a Sentry Patch

• Co-Author to O’Reilly Hadoop Application Architectures• Worked with about 70 companies in 8 countries• Marvel Fan Boy• Runner

Hello

©2014 Cloudera, Inc. All rights reserved.

5

The Problem©2014 Cloudera, Inc. All rights reserved.

6

Credit Card Transaction Fraud

©2014 Cloudera, Inc. All rights reserved.

7

Ikea Meat Balls

©2014 Cloudera, Inc. All rights reserved.

8

Coupon Fraud

©2014 Cloudera, Inc. All rights reserved.

9

Video Game Strategy

©2014 Cloudera, Inc. All rights reserved.

10

Health Insurance Fraud

©2014 Cloudera, Inc. All rights reserved.

11

• Typical Atomic Card Fraud Detection• Ikea Meat Ball• Multi Coupons Combinations • OP or Negative Video Games Strategies • Ad Serving • Health Insurance Fraud• Kid Coming Home From School

Review of the Problem

©2014 Cloudera, Inc. All rights reserved.

12

How do we React• Human Brain at Tennis – Muscle Memory– Reaction Thought– Reflective Meditation

©2014 Cloudera, Inc. All rights reserved.

13

Overview of Key Technologies

©2014 Cloudera, Inc. All rights reserved.

14

Kafka©2014 Cloudera, Inc. All Rights Reserved.

15©2014 Cloudera, Inc. All rights reserved.

•Messages are organized into topics•Producers push messages•Consumers pull messages• Kafka runs in a cluster. Nodes are called brokers

The Basics

16©2014 Cloudera, Inc. All rights reserved.

Topics, Partitions and Logs

17©2014 Cloudera, Inc. All rights reserved.

Each partition is a log

18©2014 Cloudera, Inc. All rights reserved.

Each Broker has many partitions

Partition 0 Partition 0

Partition 1 Partition 1

Partition 2

Partition 1

Partition 0

Partition 2 Partion 2

19©2014 Cloudera, Inc. All rights reserved.

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

20©2014 Cloudera, Inc. All rights reserved.

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

21©2014 Cloudera, Inc. All rights reserved.

Consumers

Consumer Group Y

Consumer Group X

Consumer

Kafka Cluster

Topic

Partition A (File)

Partition B (File)

Partition C (File)

Consumer

Consumer

Consumer

Order retained with in partition

Order retained with in partition but not over

partitionsOff

Set

X

Off S

et X

Off S

et X

Off S

et Y

Off S

et Y

Off S

et Y

Off sets are kept per consumer group

22

Flume

23

Sources Interceptors Selectors Channels Sinks

Flume Agent

Short Intro to FlumeTwitter, logs, JMS, webserver, Kafka

Mask, re-format, validate…

DR, criticalMemory, file,

KafkaHDFS, HBase,

Solr

24

Flume and/or Kafka

©2014 Cloudera, Inc. All rights reserved.

Flume

UpStream

Flume Source

Interceptor

Flume Channel

Flume Sink

Down Stream

SelectorCan Be KafkaCan Be KafkaCan Be Kafka

25©2014 Cloudera, Inc. All rights reserved.

Interceptors• Mask fields• Validate information against external source• Extract fields• Modify data format• Filter or split events

26

SparkStreaming

27

Spark Streaming Example

©2014 Cloudera, Inc. All rights reserved.

1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1))3. val lines = ssc.socketTextStream("localhost", 9999)4. val words = lines.flatMap(_.split(" "))5. val pairs = words.map(word => (word, 1))6. val wordCounts = pairs.reduceByKey(_ + _)7. wordCounts.print()8. SSC.start()

28

Spark Streaming Example

©2014 Cloudera, Inc. All rights reserved.

1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf)3. val lines = sc.textFile(path, 2)4. val words = lines.flatMap(_.split(" "))5. val pairs = words.map(word => (word, 1))6. val wordCounts = pairs.reduceByKey(_ + _)7. wordCounts.print()

29Confidentiality Information Goes Here

DStream

DStream

DStream

Spark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first Batch

First Batch

Second Batch

30Confidentiality Information Goes Here

DStream

DStream

DStreamSpark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDD

RDD

RDD

Single PassFilter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1

31

Spark Streaming and HBase

©2014 Cloudera, Inc. All rights reserved.

Driver

Walker Node

Configs

Executor

Static SpaceConfigs

HConnection

Tasks Tasks

Walker NodeExecutor

Static SpaceConfigs

HConnection

Tasks Tasks

32

High Level Architecture

©2014 Cloudera, Inc. All rights reserved.

33

Real-Time Event Processing Approach

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents Hbase /

Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT

Changes and Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

34

NRT Processing©2014 Cloudera, Inc. All rights reserved.

35

Focus on NRT First

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents Hbase /

Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT

Changes and Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

NRT Event Processing with Context

36

Streaming Architecture – NRT Event Processing

©2014 Cloudera, Inc. All rights reserved.

Flume SourceFlume Source

Kafka

Initial Events Topic

Flume SourceFlume InterceptorEvent Processing

LogicLocal

MemoryHBase Client

Kafka

Answer Topic

HBase

Kafk

a Co

nsum

er

Kafk

a Pr

oduc

er

Able to respond with in 10s of milliseconds

37

Partitioned NRT Event Processing

©2014 Cloudera, Inc. All rights reserved.

Flume SourceFlume Source

Kafka

Initial Events Topic Flume SourceFlume InterceptorEvent Processing

LogicLocal

MemoryHBase Client

Kafka

Answer Topic

HBase

Kafk

a Co

nsum

er

Kafk

a Pr

oduc

er

TopicPartition A

Partition B

Partition C

Producer

Partitioner

Producer

Partitioner

Producer

Partitioner

Custom Partitioner

Better use of local memory

38

Completing the Puzzle

©2014 Cloudera, Inc. All rights reserved.

39

Micro Batching

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents Hbase /

Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT

Changes and Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

Micro Batching

Micro BatchingMicro Batching

40

Complex Topologies

©2014 Cloudera, Inc. All rights reserved.

Kafka

Initial Events Topic

Spark Streaming

Kafk

a Di

rect

Co

nnec

tion

Dag Topologies

Kafka

Initial Events Topic

Spark StreamingKafka Receivers Dag Topologies

Kafka Receivers

Kafka Receivers

• Manages Offset• Stores Offset is RDD• No longer needs HDFS for initial RDD check

pointing

• Lets Kafka Manage Offsets• Uses HDFS for initial RDD recovery

1.3

1.2

41©2014 Cloudera, Inc. All rights reserved.

MicroBatch Bad-Input Handling

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – incoming events topic

Dag Topologies

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – bad events topic

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – resolved events topic

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – results topic

42

Ingestion

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents Hbase /

Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT

Changes and Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

Ingestion

Ingestion

43

Ingestion

©2014 Cloudera, Inc. All rights reserved.

Flume HDFS SinkKafka Cluster

TopicPartition A

Partition B

Partition C

SinkSinkSink

HDFS

Flume SolR SinkSinkSinkSink

SolR

Flume Hbase SinkSinkSinkSink

HBase

44

Reflective Thoughts

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents Hbase /

Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT

Changes and Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

Research and Searching

©2014 Cloudera, Inc. All rights reserved.