How do you decide where your customer was?
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
278 -
download
0
Transcript of How do you decide where your customer was?
Burak IŞIKLI Erkan HASPULAT
How Do You Decide Where Your Customer
Was?
Who are we?
Burak IŞIKLISenior Software Engineer, Turkcellburak.isikli at turkcell dot com.tr@burakisikli, github.com/burakisikli
Erkan HASPULATSenior Software Engineer, Turkcellerkan.haspulat at turkcell dot com.tr@erkanhaspulat, github.com/ehaspulat
Turkcell
• It is the first and only Turkish company ever to be listed on the NYSE.
Our Hadoop Journey
2013-…15TB data processing daily6TB log transferring daily
350 jobs running jobs daily
Our Hadoop Journey• ETL
CDR Analysis, Log Processing...• Analytics
Fraud Analytics, Clickstream, Recommendation Engine
• Data Lake Customer Journey
Architecture
RDBMS
Logs
Tuna
Unix/Internal Tools
HADOOP
DB Dashboards
Archive Ad-Hoc Analysis
Mining
External Systems
Alarm Systems
Issues
• Default python is 2.6 but Spark Ipython works with Python 2.7+
• Security&Auditing Issues Copying Data by Masking Dynamic Data Masking SOX Compliance
Location Analysis
Find the subscriber’s location using cell information.
11 billions rows/day 0.5 Tb/day 2.5 hours processing time
Hadoop Streaming w/Perl Sqoop
Recursive Process1. Subscriber calls2. XDR is generated3. FTP/SCP process is started4. Put into HDFS
Dispatcher
Dispatcher
Why not Flume?Ftp, scp, rsync? Think about 6TB/dayRsync and ftp works serial!
DispatcherExperiment - CDRsEvery 15 secs up to 1mb/file, total 10mb gzipped files
java.lang.OutOfMemoryErrorPIG_HEAPSIZE=8000Failed reallocation of scalar replaced objectsJDK-8145996
Img: http://bit.ly/1QSRkGn
Location AnalysisJoin? Perl?Mapper>Header/TrailerJoinColumn-fileName-Rest
Img: http://bit.ly/1TrUJhv
A tree in the forest
Img: http://go.nasa.gov/1SX3wGl
Volume
May 2014 Jul. 2014 Nov. 2014
1 TB 1.5 TB 2 TB
Data size is growing too fast!What about LTE?
Data
Time Mar. 2016
4.5 TB
Adopt the volume
Perl Job 212 min 104 minPig Job 77 min 18 min
Upgrade
• No space left on disk!• Hadoop upgrade
0.23 -> 1.3.1 -> 2.7.1• Linear scalability
Nodes 1M + 4D 1M + 1SM + 16D
Disk 15 TB 698 TB
CPU 20 Core 224 Core
Memory 1024 GB 1.5TB
Version 0.23 HDP 2.3.2
Industry-Specific Analysis
Competitor ComparisonE.g. Shopping center comparison in Istanbul based on city, district, demographic information (age, sex, income, job… etc.)
Industry-Specific Analysis
But how?Perl?Hive or Pig?What else?
Industry-Specific Analysis
But how? Hive external partitionALTER TABLE t1 ADD PARTITION(DAILY_CALENDAR_ID=‘20160101') LOCATION '/user/…/tlc/daily_calendar_id=20160101'"
Movement Index
Subscribers journeys is provided to determine an analysis with which they transport between cities via signaling data• Airline companies• Bus companies• Local government• Survey companies
Movement Index
Simple Euclidean Distance
Equ: https://en.wikipedia.org/wiki/Euclidean_distance
Finding the change of locationFirst, find out the closeness of each cell using coordinates
Movement Index
Euclidean Distance Hive Query – Cross Join
Movement Index
Finally, all needs to be done is simple another query
Img: http://bit.ly/1MKAiuT
Movement Index
But one problem!
java.io.IOException:java.lang.IllegalArgumentException: Column [daily_calendar_id] was not found in schema! Are you kidding me!!
Img: http://bit.ly/1N3QKRQ
Movement Index
Just another bug:HIVE-11401Workaround solution: hive.optimize.index.filter=false;Permanent solution: Hive 2.0
Img: http://bit.ly/1W1USZM
Is it enough?Img: http://bit.ly/1ZUEsm7
Ongoing Projects
Movement Predicton Spark ML
Real Time Location Analysis Spark Streaming
Hadoop on SQL: Spark SQL, Impala… etc.
Img: http://bit.ly/25DxPsq
Acknowledgements
Special thanks toCaner CANAKUğur Cumhur ÇELİK
Img: http://bit.ly/1RVnde7
Thank You!