Bui Quang Duy @ Septeni Technology
Hanoi 2014/01
� � Introduction � Hadoop
� Hadoop Architecture � HDFS
� PYXIS & Hadoop
Outline
� � Starting in Vietnam since March 2013 � Totally 45 employees � Heading to No.1 Ad Technology center in Asia
What’s Septeni technology
� � A programming model to distribute a task on multiple
nodes � Used to develop solutions that will process large amounts
of data in a parallelized fashion in clusters of computing nodes
� Features of MapReduce: � Fault-tolerance � Status and monitoring tools � A clean abstraction for programmers
What’s Mapreduce
. . .
User Program
Master
Split 1
Split 2
Split 3
Split 4
Split 5 . . .
Worker
Worker
Worker
Input Files Map Phase
Key/Value Pairs
Worker
Worker
Intermediate Operations
Output file 1
Reduce Phase
Remote read
Output Files
Fork Fork Fork
Write Local Write
Assign Map
Assign Reduce
MapReduce Execution Overview
Output file 2
� Hadoop
� Open Source Implementation of MapReduce by Apache Software Foundation.
� Created by Doug Cutting. � Derived from Google's MapReduce and Google File
System (GFS) papers. � Apache Hadoop is a software framework that supports
data-intensive distributed applications under a free license
� It enables applications to work with thousands of computational independent computers and petabytes of data.
� Hadoop Components
HDFS
Storage
Self-healing high-bandwidth clustered storage
MapReduce
Processing
Fault-tolerant distributed processing
�
Hadoop Architecture
Secondary Namenode
Namenode JobTracker
Data node
TaskTracker
Map Map
Map
Reduce
Data node
TaskTracker
Map
Data node
TaskTracker
Map
Reduce Reduce
Map Map
Reduce Reduce Reduce
Reduce
Map Map
Reduce Reduce
� Dataflow in Hadoop
� Map tasks write their output to local disk � Output available after map task has completed
� Reduce tasks write their output to HDFS � Once job is finished, next job’s map tasks can be
scheduled, and will read input from HDFS
� Therefore, fault tolerance is simple: simply re-run tasks on failure � No consumers see partial operator output
� HDFS Basics
� HDFS is a filesystem written in Java � Sits on top of a native filesystem � Provides redundant storage for massive amounts
of data � Use Commodity devices
� HDFS Data
� Data is split into blocks and stored on multiple nodes in the cluster
� Each block is usually 64 MB or 128 MB � Each block is replicated multiple times � Replicas stored on different data nodes
What’s PYXIS
PYXIS is one-stop service of Ad Management, Measurement, Optimization system of online ads specialized in Facebook. Only 1 system with the approval from Facebook as Both of PMD(Ads manage) and MMP(Measurement) in the world.
Specialized in Mobile & LTV maximization.
Main Features
Massive ad creation
Graphical Reporting
Auto optimization
Mobile measurement
LTV Maximization
Auto bidding Automated optimization – auto bidding and reallocating
Data source
Summarize & Analyze
Tuning Campaign & Ad
- Ad information (Targeting segment & Ad creative) - Delivery data (Impression, Click, Cost,…) - Action data (Like, Install, Billing, LTV)
PYXIS get massive data every hour Summarize and Integrate all data into optimized unit
Change bid price and budget based on unit data
� The end!
Top Related