How do you decide where your customer was?

Post on 16-Apr-2017

279 views 0 download

Transcript of How do you decide where your customer was?

Burak IŞIKLI Erkan HASPULAT

How Do You Decide Where Your Customer

Was?

Who are we?

Burak IŞIKLISenior Software Engineer, Turkcellburak.isikli at turkcell dot com.tr@burakisikli, github.com/burakisikli

Erkan HASPULATSenior Software Engineer, Turkcellerkan.haspulat at turkcell dot com.tr@erkanhaspulat, github.com/ehaspulat

Turkcell

• It is the first and only Turkish company ever to be listed on the NYSE.

Our Hadoop Journey

2013-…15TB data processing daily6TB log transferring daily

350 jobs running jobs daily

Our Hadoop Journey• ETL

CDR Analysis, Log Processing...• Analytics

Fraud Analytics, Clickstream, Recommendation Engine

• Data Lake Customer Journey

Architecture

RDBMS

Logs

Tuna

Unix/Internal Tools

HADOOP

DB Dashboards

Archive Ad-Hoc Analysis

Mining

External Systems

Alarm Systems

Issues

• Default python is 2.6 but Spark Ipython works with Python 2.7+

• Security&Auditing Issues Copying Data by Masking Dynamic Data Masking SOX Compliance

Location Analysis

Find the subscriber’s location using cell information.

11 billions rows/day 0.5 Tb/day 2.5 hours processing time

Hadoop Streaming w/Perl Sqoop

Recursive Process1. Subscriber calls2. XDR is generated3. FTP/SCP process is started4. Put into HDFS

Dispatcher

Dispatcher

Why not Flume?Ftp, scp, rsync? Think about 6TB/dayRsync and ftp works serial!

DispatcherExperiment - CDRsEvery 15 secs up to 1mb/file, total 10mb gzipped files

java.lang.OutOfMemoryErrorPIG_HEAPSIZE=8000Failed reallocation of scalar replaced objectsJDK-8145996

Img: http://bit.ly/1QSRkGn

Location AnalysisJoin? Perl?Mapper>Header/TrailerJoinColumn-fileName-Rest

Img: http://bit.ly/1TrUJhv

A tree in the forest

Img: http://go.nasa.gov/1SX3wGl

Volume

May 2014 Jul. 2014 Nov. 2014

1 TB 1.5 TB 2 TB

Data size is growing too fast!What about LTE?

Data

Time Mar. 2016

4.5 TB

Adopt the volume

Perl Job 212 min 104 minPig Job 77 min 18 min

Upgrade

• No space left on disk!• Hadoop upgrade

0.23 -> 1.3.1 -> 2.7.1• Linear scalability

Nodes 1M + 4D 1M + 1SM + 16D

Disk 15 TB 698 TB

CPU 20 Core 224 Core

Memory 1024 GB 1.5TB

Version 0.23 HDP 2.3.2

Industry-Specific Analysis

Competitor ComparisonE.g. Shopping center comparison in Istanbul based on city, district, demographic information (age, sex, income, job… etc.)

Industry-Specific Analysis

But how?Perl?Hive or Pig?What else?

Industry-Specific Analysis

But how? Hive external partitionALTER TABLE t1 ADD PARTITION(DAILY_CALENDAR_ID=‘20160101') LOCATION '/user/…/tlc/daily_calendar_id=20160101'"

Movement Index

Subscribers journeys is provided to determine an analysis with which they transport between cities via signaling data• Airline companies• Bus companies• Local government• Survey companies

Movement Index

Simple Euclidean Distance

Equ: https://en.wikipedia.org/wiki/Euclidean_distance

Finding the change of locationFirst, find out the closeness of each cell using coordinates

Movement Index

Euclidean Distance Hive Query – Cross Join

Movement Index

Finally, all needs to be done is simple another query

Img: http://bit.ly/1MKAiuT

Movement Index

But one problem!

java.io.IOException:java.lang.IllegalArgumentException: Column [daily_calendar_id] was not found in schema! Are you kidding me!!

Img: http://bit.ly/1N3QKRQ

Movement Index

Just another bug:HIVE-11401Workaround solution: hive.optimize.index.filter=false;Permanent solution: Hive 2.0

Img: http://bit.ly/1W1USZM

Is it enough?Img: http://bit.ly/1ZUEsm7

Ongoing Projects

Movement Predicton Spark ML

Real Time Location Analysis Spark Streaming

Hadoop on SQL: Spark SQL, Impala… etc.

Img: http://bit.ly/25DxPsq

Acknowledgements

Special thanks toCaner CANAKUğur Cumhur ÇELİK

Img: http://bit.ly/1RVnde7

Thank You!