How do you decide where your customer was?

29
Burak IŞIKLI Erkan HASPULAT How Do You Decide Where Your Customer Was?

Transcript of How do you decide where your customer was?

Page 1: How do you decide where your customer was?

Burak IŞIKLI Erkan HASPULAT

How Do You Decide Where Your Customer

Was?

Page 2: How do you decide where your customer was?

Who are we?

Burak IŞIKLISenior Software Engineer, Turkcellburak.isikli at turkcell dot com.tr@burakisikli, github.com/burakisikli

Erkan HASPULATSenior Software Engineer, Turkcellerkan.haspulat at turkcell dot com.tr@erkanhaspulat, github.com/ehaspulat

Page 3: How do you decide where your customer was?

Turkcell

• It is the first and only Turkish company ever to be listed on the NYSE.

Page 4: How do you decide where your customer was?

Our Hadoop Journey

2013-…15TB data processing daily6TB log transferring daily

350 jobs running jobs daily

Page 5: How do you decide where your customer was?

Our Hadoop Journey• ETL

CDR Analysis, Log Processing...• Analytics

Fraud Analytics, Clickstream, Recommendation Engine

• Data Lake Customer Journey

Page 6: How do you decide where your customer was?

Architecture

RDBMS

Logs

Tuna

Unix/Internal Tools

HADOOP

DB Dashboards

Archive Ad-Hoc Analysis

Mining

External Systems

Alarm Systems

Page 7: How do you decide where your customer was?

Issues

• Default python is 2.6 but Spark Ipython works with Python 2.7+

• Security&Auditing Issues Copying Data by Masking Dynamic Data Masking SOX Compliance

Page 8: How do you decide where your customer was?

Location Analysis

Find the subscriber’s location using cell information.

11 billions rows/day 0.5 Tb/day 2.5 hours processing time

Hadoop Streaming w/Perl Sqoop

Page 9: How do you decide where your customer was?

Recursive Process1. Subscriber calls2. XDR is generated3. FTP/SCP process is started4. Put into HDFS

Dispatcher

Page 10: How do you decide where your customer was?

Dispatcher

Why not Flume?Ftp, scp, rsync? Think about 6TB/dayRsync and ftp works serial!

Page 11: How do you decide where your customer was?

DispatcherExperiment - CDRsEvery 15 secs up to 1mb/file, total 10mb gzipped files

java.lang.OutOfMemoryErrorPIG_HEAPSIZE=8000Failed reallocation of scalar replaced objectsJDK-8145996

Img: http://bit.ly/1QSRkGn

Page 12: How do you decide where your customer was?

Location AnalysisJoin? Perl?Mapper>Header/TrailerJoinColumn-fileName-Rest

Img: http://bit.ly/1TrUJhv

Page 13: How do you decide where your customer was?

A tree in the forest

Img: http://go.nasa.gov/1SX3wGl

Page 14: How do you decide where your customer was?

Volume

May 2014 Jul. 2014 Nov. 2014

1 TB 1.5 TB 2 TB

Data size is growing too fast!What about LTE?

Data

Time Mar. 2016

4.5 TB

Page 15: How do you decide where your customer was?

Adopt the volume

Page 16: How do you decide where your customer was?

Perl Job 212 min 104 minPig Job 77 min 18 min

Upgrade

• No space left on disk!• Hadoop upgrade

0.23 -> 1.3.1 -> 2.7.1• Linear scalability

Nodes 1M + 4D 1M + 1SM + 16D

Disk 15 TB 698 TB

CPU 20 Core 224 Core

Memory 1024 GB 1.5TB

Version 0.23 HDP 2.3.2

Page 17: How do you decide where your customer was?

Industry-Specific Analysis

Competitor ComparisonE.g. Shopping center comparison in Istanbul based on city, district, demographic information (age, sex, income, job… etc.)

Page 18: How do you decide where your customer was?

Industry-Specific Analysis

But how?Perl?Hive or Pig?What else?

Page 19: How do you decide where your customer was?

Industry-Specific Analysis

But how? Hive external partitionALTER TABLE t1 ADD PARTITION(DAILY_CALENDAR_ID=‘20160101') LOCATION '/user/…/tlc/daily_calendar_id=20160101'"

Page 20: How do you decide where your customer was?

Movement Index

Subscribers journeys is provided to determine an analysis with which they transport between cities via signaling data• Airline companies• Bus companies• Local government• Survey companies

Page 21: How do you decide where your customer was?

Movement Index

Simple Euclidean Distance

Equ: https://en.wikipedia.org/wiki/Euclidean_distance

Finding the change of locationFirst, find out the closeness of each cell using coordinates

Page 22: How do you decide where your customer was?

Movement Index

Euclidean Distance Hive Query – Cross Join

Page 23: How do you decide where your customer was?

Movement Index

Finally, all needs to be done is simple another query

Img: http://bit.ly/1MKAiuT

Page 24: How do you decide where your customer was?

Movement Index

But one problem!

java.io.IOException:java.lang.IllegalArgumentException: Column [daily_calendar_id] was not found in schema! Are you kidding me!!

Img: http://bit.ly/1N3QKRQ

Page 25: How do you decide where your customer was?

Movement Index

Just another bug:HIVE-11401Workaround solution: hive.optimize.index.filter=false;Permanent solution: Hive 2.0

Img: http://bit.ly/1W1USZM

Page 26: How do you decide where your customer was?

Is it enough?Img: http://bit.ly/1ZUEsm7

Page 27: How do you decide where your customer was?

Ongoing Projects

Movement Predicton Spark ML

Real Time Location Analysis Spark Streaming

Hadoop on SQL: Spark SQL, Impala… etc.

Img: http://bit.ly/25DxPsq

Page 28: How do you decide where your customer was?

Acknowledgements

Special thanks toCaner CANAKUğur Cumhur ÇELİK

Img: http://bit.ly/1RVnde7

Page 29: How do you decide where your customer was?

Thank You!