Building a Big Data Analytics Service Framework for Mobile...

12
Building a Big Data Analytics Service Framework for Mobile Advertising and Marketing Lei Deng School of Computer Northwestern Polytechnical University Xi'an, China [email protected] Jerry Gao and Chandrasekar Vuppalapati Computer Engineering Department San Jose State University San Jose, United States Corresponding mail: [email protected] Abstract - The unprecedented growth in mobile device adoption and the rapid advancement of mobile technologies & wireless networks have created new opportunities in mobile marketing and adverting. The opportunities for Mobile Marketers and Advertisers include real-time customer engagement, improve customer experience, build brand loyalty, increase revenues, and drive customer satisfaction. The challenges, however, for the Marketers and Advertisers include how to analyze troves of data that mobile devices emit and how to derive customer engagement insights from the mobile data. This research paper addresses the challenge by developing Big Data Mobile Marketing analytics and advertising recommendation framework. The proposed framework supports both offline and online advertising operations in which the selected analytics techniques are used to provide advertising recommendations based on collected Big Data on mobile user's profiles, access behaviors, and mobility patterns. The paper presents prototyping solution design as well as its application and certain experimental results. Keywords - big data analytics, big data application, Big data analytics service, mobile advertising, mobile marketing, and e- ncommerce I. INTRODUCTION Mobile Advertising poses three distinctive Opportunities and/or Challenges to Marketers and Advertisers: 1) the unprecedented adoption, 2) the customer engagement challenges due to huge mobile datasets, and, finally, and 3) the potential impact of mobility on digital marketing. First, the Mobile device adoption is increasing at a rapid pace. As indicated in [1], “every day more than 1 million new Android devices are activated worldwide”. Similarly, Apple, in September 2014, announced that “it had sold over 10 million iPhone 6‟s in the first three days of it being available. This is only 1 million more than the over 9 million iPhone 5c‟s and 5s‟ that it sold in 2013[2].” In [3], the McKinsey Global Institute predicted that the full potential of the mobile Internet is yet to be realized; over the coming decade, this technology could fuel significant transformation and disruption, not least from its potential to bring two billion to three billion more people into the connected world, mostly from developing economies. McKinsey institute estimates that the Mobile Internet could generate annual economic impact of $3.7 trillion to $10.8 trillion globally by 2025. This value would come from three main sources: a) improved delivery of services, b) productivity increases in select work categories, and c) the value from Internet use for the new Internet users who are likely to be added in 2025 [24]. Second, the mobile devices come with different form factors, technologies, data points, and operating systems. The same is true with the users of the mobile devices. The Mobile users exhibit diverse demographics, personal preferences, behavior, social presence, and location usage. Cohen [4] stated that according to Nielsen, 61% of the USA subscriber owned a smartphone. The demographics of the mobile usage in USA shows that 81% of adults aged 25-34 have smartphones. Almost 70% of US teens age 13 - 17 use a smartphone, and 50% of US adults 55+ own a smartphone. This clearly shows the age variation associated with smartphone usage in USA. As the complexity of mobile phone increases and as the mobile users‟ demographics & personal preferences differ, the associated size of datasets with the devices and users‟ will increase dramatically. In order to engage with the customers in a meaningful way, one needs to analyze the huge and diverse datasets. Third, recent, May 2014, Gartner Market Analysis [5] reveals that Mobility increasingly defines digital marketing. As per the Gartner research, consumers are increasingly using the Mobile phones as the remote for their lives. Gartner Mobility Market Survey reveals that: a) 43% of respondents spend more time on tablets than desktop, b) 80% smartphone owners have used their device while shopping, c) 53% of searchers purchase as a result of a smartphone search, d) and 86% use their phones while consuming other media. The social and commerce activities that consumers engage on their Mobile phones are re- defining Mobile Adverting. It is clear that the mobile devices present huge opportunities to the marketers and advertisers. However, there are challenges too. The challenges are associated with the type of the mobile data, structured & unstructured; the exclusivity of the data; the privacy & the context information associated with the data; and the mining of advertising insights from the mobile user data. Unless thoroughly data mined, there is a huge chasm between the mobile users‟ expectations and the marketers‟/advertisers‟ exclusive campaigns to target

Transcript of Building a Big Data Analytics Service Framework for Mobile...

Page 1: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

Building a Big Data Analytics Service Framework

for Mobile Advertising and Marketing

Lei Deng

School of Computer

Northwestern Polytechnical University

Xi'an, China

[email protected]

Jerry Gao and Chandrasekar Vuppalapati

Computer Engineering Department

San Jose State University

San Jose, United States

Corresponding mail: [email protected]

Abstract - The unprecedented growth in mobile device

adoption and the rapid advancement of mobile technologies &

wireless networks have created new opportunities in mobile

marketing and adverting. The opportunities for Mobile

Marketers and Advertisers include real-time customer

engagement, improve customer experience, build brand

loyalty, increase revenues, and drive customer satisfaction.

The challenges, however, for the Marketers and Advertisers

include how to analyze troves of data that mobile devices emit

and how to derive customer engagement insights from the

mobile data. This research paper addresses the challenge by

developing Big Data Mobile Marketing analytics and

advertising recommendation framework. The proposed

framework supports both offline and online advertising

operations in which the selected analytics techniques are used

to provide advertising recommendations based on collected

Big Data on mobile user's profiles, access behaviors, and

mobility patterns. The paper presents prototyping solution

design as well as its application and certain experimental

results.

Keywords - big data analytics, big data application, Big data

analytics service, mobile advertising, mobile marketing, and e-

ncommerce

I. INTRODUCTION

Mobile Advertising poses three distinctive Opportunities and/or Challenges to Marketers and Advertisers: 1) the unprecedented adoption, 2) the customer engagement challenges due to huge mobile datasets, and, finally, and 3) the potential impact of mobility on digital marketing.

First, the Mobile device adoption is increasing at a rapid pace. As indicated in [1], “every day more than 1 million new Android devices are activated worldwide”. Similarly, Apple, in September 2014, announced that “it had sold over 10 million iPhone 6‟s in the first three days of it being available. This is only 1 million more than the over 9 million iPhone 5c‟s and 5s‟ that it sold in 2013[2].” In [3], the McKinsey Global Institute predicted that the full potential of the mobile Internet is yet to be realized; over the coming decade, this technology could fuel significant transformation and disruption, not least from its potential to bring two billion to three billion more people into the connected world, mostly from developing economies. McKinsey institute estimates that the Mobile Internet could

generate annual economic impact of $3.7 trillion to $10.8 trillion globally by 2025. This value would come from three main sources: a) improved delivery of services, b) productivity increases in select work categories, and c) the value from Internet use for the new Internet users who are likely to be added in 2025 [24].

Second, the mobile devices come with different form factors, technologies, data points, and operating systems. The same is true with the users of the mobile devices. The Mobile users exhibit diverse demographics, personal preferences, behavior, social presence, and location usage. Cohen [4] stated that according to Nielsen, 61% of the USA subscriber owned a smartphone. The demographics of the mobile usage in USA shows that 81% of adults aged 25-34 have smartphones. Almost 70% of US teens age 13 - 17 use a smartphone, and 50% of US adults 55+ own a smartphone. This clearly shows the age variation associated with smartphone usage in USA. As the complexity of mobile phone increases and as the mobile users‟ demographics & personal preferences differ, the associated size of datasets with the devices and users‟ will increase dramatically. In order to engage with the customers in a meaningful way, one needs to analyze the huge and diverse datasets.

Third, recent, May 2014, Gartner Market Analysis [5] reveals that Mobility increasingly defines digital marketing. As per the Gartner research, consumers are increasingly using the Mobile phones as the remote for their lives. Gartner Mobility Market Survey reveals that: a) 43% of respondents spend more time on tablets than desktop, b) 80% smartphone owners have used their device while shopping, c) 53% of searchers purchase as a result of a smartphone search, d) and 86% use their phones while consuming other media. The social and commerce activities that consumers engage on their Mobile phones are re-defining Mobile Adverting.

It is clear that the mobile devices present huge opportunities

to the marketers and advertisers. However, there are

challenges too. The challenges are associated with the type of

the mobile data, structured & unstructured; the exclusivity of

the data; the privacy & the context information associated with

the data; and the mining of advertising insights from the

mobile user data. Unless thoroughly data mined, there is a

huge chasm between the mobile users‟ expectations and the

marketers‟/advertisers‟ exclusive campaigns to target

Page 2: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

mindshare and wallet share of the mobile users. Analytics will

help the marketers/advertisers in understanding of framing the

consumer behavior and start understand their expectation of

how consumers like to communicate with them. In essence,

the Analytics will help to close the chasm.

This paper addresses the strong demand of mobile advertising

based on a big data analytics approach. Its major contribution

is its proposed innovative solution based a big data analytics

service framework supporting mobile advertising based on

multi-dimension big data analytics. In this approach, we focus

on location-based mobile advertising using by analyzing the

big data including mobile user profiling, mobile app usage

patterns, location-based mobile user access patterns, as well as

merchant related data. The paper presents our big data

analytics methods, service framework design, as well as our

case study results.

The structure of this paper is presented as follows. Section 2

discusses the basic concepts and methods about big data

analytics for mobile advertising. Section 3 presents our big

data analytics service system by focusing on its service

framework. Section 4 discusses its related design and

implementation decisions, and Section 5 shows a case study.

The conclusion and future work is included in Section 6.

II. UNDERSTANDING BIG DATA ANALYTICS FOR MOBILE

ADVERTISING

Big data analytics.

Big data computing environment towards analytics: Hadoop

definitely attracts much attention as it is the first open source

distributed computing environment. But there are other

platforms that have interesting advantages to the typical

Hadoop implementation, especially in the real-time analytics

of dynamic information where Hadoop does not meet the

requirement. Contrast to Hadoop, a batch processing

framework, Storm is a stream processing framework and

focuses on continuous computation [6]. Storm was developed

at twitter to process hundreds of millions of tweets generated

every day and now is an open source big data analysis system.

Spark is a scalable data analysis platform based on In-Memory

Computing and has performance advantage to Hadoop‟s

cluster storage method [7]. Spark is written in Scala and offers

single data processing environment. Spark supports iteration

tasks of distributed data sets.

Big data analytics techniques:There are many Big Data

techniques including association rule learning, data mining, cluster analysis, crowdsourcing, machine learning, text analytics, classification, data fusion, network analysis, optimization, predictive modeling, regression, special analysis, time series analysis and others. So, which ones are used depends on the type of data being analyzed, the technology available and the research questions one is trying to solve.

Marketing analytics and advertising recommendation

Decision Tree: Decision tree algorithm is used to classify

the attributes and decide the outcome of the class attribute. In

order to construct decision tree both class attribute and item

attributes are required. Decision tree is a tree like structure

where the intermediate nodes represent attributes of the data,

leaf nodes represents the outcome of the data and the branches

hold the attribute value. Decision trees are widely used in the

classification process because no domain knowledge is needed

to construct the decision tree. Figure 1 shows simple decision

trees.

Figure 1 Decision Tree Examples

The primary step in the decision tree algorithm is to identify

the root node for the given set of data. Multiple methods exist

to decide the root node of the decision tree. Information gain

and Gini impurity are the primary methods used to identify the

root node. Root node plays important role in deciding which

side of decision tree the data falls into. Like every

classification methods, decision trees are also constructed

using the training data and tested with the test data.

Information Gain: Information gain is used to the root node

and the branch nodes in the decision tree. Information gain is

calculated using entropy and information. Entropy is

calculated using the following formula [8].

Information of the attribute is calculated using the following formula.

Information gain of an attribute is the difference between

entropy and the information of that attribute. The attribute

with the highest information gain is the root node, and the next

level nodes are identified using the next high information gain

attributes. The algorithm and its pros and cons are listed below.

Algorithm:

Step 1: Calculate the information gain for all the attributes

Step 2: Select the root node from the attribute list that has more

information gain

Step 3: For each value of the root node

Page 3: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

Step 4: Create a node for the attributes with next highest

information gain.

Step 5: For each value of the nodes

Step 6: Create subset of training data for this node

Step 7: If all the values of class node are same, create a leaf node

and stop

Step 8: Else go to step 5 and continue

K Means Clustering: Clustering is identifying and

classifying the items in to similar groups. K-means clustering

is classifying the items in to k clusters based on their similarity.

K is the number of clusters that we have to decide before

starting the clustering process. The whole solution depends on

the K value. So, it is very important to choose a correct K

value. The data point is grouped in to a cluster based on the

Euclidean distance between the point and the centroid of the

cluster. Initial clustering can be done in one of three ways. 1. Dynamically Chosen: In this method, we will choose

the first K items and assign to K clusters.

2. Randomly Chosen: In this method, we randomly select the values and assign them to K clusters.

3. Choosing from Upper and Lower Boundaries: In this method we will choose the values that are very distant from each other and use them as initial values for each cluster.

Clustering Algorithm: (see Figure 2)

Figure 2 Clustering Algorithm

Step 1: Choose the initial values using one of the above three methods

Step 2: For each additional value Step 3: Calculate the Euclidean distance between this point

and centroid of the clusters.

Step 4: Move the value to the nearest cluster. Step 5: Calculate the new centroid for the cluster. Step 6: Repeat steps 3 to 5 Step 7: Calculate centroid of the cluster. Step 8: For each value Step 9: Calculate the Euclidean distance between this

value and the centroid of all the clusters. Step 10: Move the value to the nearest cluster.

K- Nearest Neighbour: The k-nearest-neighbor method was

first described in the early 1950s. The method is labor

intensive when given large training sets, and did not gain

popularity until the 1960s when increased computing power

became available. It has since been widely used in the area of

pattern recognition [8].

Nearest-neighbor classifiers are based on learning by

analogy, that is, by comparing a given test tuple with training

tuples that are similar to it. The training tuples are described

by n attributes. Each tuple represents a point in an n-

dimensional space. In this way, all of the training tuples are

stored in an n-dimensional pattern space. When given an

unknown tuple, a k-nearest-neighbor classifier searches the

pattern space for the k training tuples that are closest to the

unknown tuple. These k training tuples are the k “nearest

neighbors” of the unknown tuple.

When the „k‟ closest points are obtained, the unknown

sample is then assigned to the most common class among

those k-points. In case of k=1, the unknown sample is

assigned to the closest point in the pattern space. The

closeness is measured using the distance between the two

points. The following table defines some of the approaches to

find distances between two points.

Among the above-mentioned distances, the most used

similarity/distance metric is Euclidean distance followed by

Manhattan. kNN features the following properties.

Page 4: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

1. It is instance-based

2. It is a Lazy Learning algorithm

Eager learners construct a generalization model before starting

the classification i.e., before receiving any unknown samples

for classification. Back propagation and Decision tree

induction are examples of eager learners. In contrast to eager

learners, Lazy learners like k-Nearest Neighbors don‟t

construct the classifier until they receive unknown sample.

They just keep storing the training sample till they receive the

unknown sample. So, there is chance that Lazy Learners may

incur additional charges in terms of computation and also

memory to store all the training samples, especially when the

training samples to be compare the unknown sample is great.

Therefore, Lazy Learner algorithms need efficient storage and

indexing techniques. Since the computation in the Lazy

Learners is delayed till the specification of unknown samples,

the classification is slower. Even though, Lazy Learners

feature such disadvantages, they are very good at training.

kNN applies equal weights to all the attributes unlike back

propagation or decision tree. This may lead to perplexity when

there are many irrelevant attributes present in the data [8].

An instance can be considered as a point in the n-dimensional

pattern space. Every feature of the instance is related to each

of the n-dimensional pattern space. The closeness is defined

using the distance measures mentioned in the table where each

of those metrics tries maximize the distance between two

extreme points or the points that lie in two different classes

while minimize the distance between two analogous points or

close points that lie in the same class.

Recommendation Systems: The recommender systems are

grouped into two systems: content-based and collaborative

filtering. Content-based systems examine properties of the

items recommended. For instance, if a Netflix user has

watched many cowboy movies, then recommend a movie

classified in the database as having the “cowboy” genre.

Collaborative filtering systems recommend items based on

similarity measures between users and/or items. The items

recommended to a user are those preferred by similar users.

Recommendation systems always apply knowledge discovery

techniques to provide personalize recommendations. The

amount of people accessing the web today has grown

tremendously and thus, dealing with such a spare data set is a

big challenge for a recommender system. In traditional

recommendation systems, which uses collaborative filtering

algorithm, the amount of work increases as the number of

users increases. The new recommender system has been

designed to quickly produce highly accurate recommendations

for users on the web. There are two types of collaborative

filtering techniques which are generally used to provide

recommendation: user-based collaborative filtering and item-

based collaborative filtering. High level architecture of

recommendation system is shown below.[9]

Figure 3 One Recommendation Process [9]

The recommendation process is performed in three steps, each

of which is handled by a separate component:

Content Analyzer: When information has no structure (e.g.

Geo-Location details, user preferences, social media posts),

some kind of pre-processing step is needed to extract

structured relevant information. The responsibility of the

component is to represent the content of items (e.g. documents,

Web pages, GeoLocation, product descriptions, etc.) coming

from information sources in a form suitable for the next

processing steps. Data items are analyzed by feature extraction

techniques in order to shift item representation from the

original information space to the target one. This

representation is the input to the PROFILE LEARNER and

FILTERING COMPONENT;

Profile Learner: This module collects data representative

of the user preferences and attempts to generalize this

data, in order to frame the user profile.

Filtering Component: This module exploits the user

profile to suggest relevant items by matching the profile

representation against that of items to be recommended.

III. THE SYSTERM OVERVIEW

To support system scalability, we use a holistic approach for location-based ad recommendations. Our system leverages the latest open source technologies to create a big data processing platform. The core recommendation engine provides a training predictive model on a training set by using machine learning algorithms, such as collaborative filtering, clustering and classification.

System Architecture

As shown in Figure 4, we designed and developed a domain-specific big data service platform for mobile advertising and marketing. The system enables location-based

Page 5: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

adverting to engage with its target customers by studying their profiles and dynamic behavior patterns. Unlike other data analytics engines, our system provides a holistic advertising recommendation approach for mobile users by providing a real-time big data based solution for precise marketing and analysis. The current system uses the state of the art big data technologies, such as MongoDB and Spark over a cloud

infrastructure. The outcome of this research project consists of three parts: a) precise market advertising and analysis algorithms; b) recommendation analytics algorithms; and c) a prototype system that implements the proposed algorithms and solutions based on location-based advertising solutions.. Figure 4 shows the detailed layered architecture.

Figure 4 A Big Data Ad Recommendation Service System Architecture

The System Functions

Figure 4 shows the following function components.

1) Device Location: If the user has enabled user location

sharing on his device, he can track his location by the GeoID

associated with his current location. The GeoID can be

divided into latitude, longitude corresponding to the zip code

of location associated with that user.

2) Supporting (latitude, longitude) -- > (ZipCode,

Country, State, Street #): Latitude and longitude can be

mapped by looking up the database. The accuracy may vary

depending on the GPS signal and connection or based on the

Wi-Fi router accuracy.

3) User Profile and Interests: The profile of the user to

whom the ads are going to be served also needs to be tracked.

There are various features which would be tracked such as

gender, age, address, profession, interest, etc. Based on the

profile of the user as well the GeoID, the platform would

provide modeled recommendations to the customer on his

mobile applications.

4) Ad Publisher Products Item Sets: Similar to Mobile

User data, the Advertiser Product data items are very

importatnt to generate most appropriate Ads for the Mobile

User. Our system design shall correlate the advertier product

items to User profile or preferences based on pre-defined

system mapping and recommendation ourput. At minimum,

our system assume an advertiser will store all the product item

attributes that enable correlation of mobile users to the

products.(see Table 1 below).

Table 1:The Product Attribute Set Product Item Set Attributes

Product Details – include name, description, availability

Product Category – includes type

Product Attributes such as color, size, mode

Product dimensions

Model Number

Customer reviews (if any)

Product ratings

Product Cost and Manufacturing Details

Product & Location specifics

5) Identifying relevant advertisements: The

recommendation engine would develop a predictive model

based on which the relevant advertisements would be

provided to the end user. The training dataset would be used

to create a model by applying machine learning techniques

such as collaborative filtering, clustering and classification on

the model. A key benefit of the collaborative filtering

approach is that it does not depend upon the machine

Page 6: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

analyzable content and hence is capable of exactly

recommending products without priorknowledge of the item

itself.

6) Customer-Oriented Requirements: The customer-

oriented requirements for our system generally involves

providing qualified ad recommendations based on user

preferences, behavior ,and insights from social profile. The

system would track the geo location of the user and based on

the location id and location category. The location category

could be an industrial district, university, hospital,etc. Based

on the location category associated with the user location and

the user profile, the recommendation engine would push ads

to the end user. Ttable 2 below lists the related attributes.

Table 2: MMobile User Attributes Mobile User or Customer Attributes Summary (expandable based on

the Use Cases)

Mobile User Demographics

Mobile Location & Location Preferences

Social Profile Items preferred over the internet or posting of items on

Twitter

Mobile Application Access

Mobile Commerce Data & Item Click Through Data

Customer Purchase Transaction Data

Key Technologies and Solutions

We have used a number of technologies and developed

several soultions to support our recommendation system.

1) Real-time analytics based on Spark

We have developed a recommendation engine that addresses

both off-line and real-time requirements.

Online processing generally involves direct interaction with

one or more systems and the most current slice of data for data

profiling to pick outliers, real-time customer engagement, real-

time advertisement, etc [10]. This processing usually involves

relatively low throughput since significant time may be spent

waiting for user input. Offline processing is typically batch-

oriented and often involves large volumes of data being

processed with little or no intervention (usually a system

operator rather than an end user) [11].

Apache Spark Streaming is an open source, big-data

processing system intended for distributed, real-time streaming

processing. Streaming implements a data flow model in which

data (time series facts) flows continuously through a topology

(a network of transformation entities). The slice of data being

analyzed at any moment in an aggregate function is specified

by a sliding window, a concept in Complex Event Processing

(CEP). A sliding window may be as low as "last minute", or

"last 60 minutes", which is constantly shifting over time. Data

can be fed to Streaming through distributed messaging queues

like Flume, Twitter, ZeroMQ, Kinesis or plain old TCP

sockets [12].

Sparing Streaming Architecture:

Receive data streams from input sources, process them

in a cluster, push out to databases/dashboards

Scalable, fault-tolerant, second-scale latencies [10]

Chop up data streams into batches of few seconds

Spark treats each batch of data as RDDs (Resilient

Distributed Datasets) and processes them using RDD

operations

Processed results are pushed out in batches [11]

Discretized Stream (DStream)

Represents a stream of data

Implemented as a sequence of RDDs

MapReduce is performed on each batch for aggregation

[10]

Size of each batch is defined as window length which is

in seconds. Sliding interval is the time interval

difference between 2 consecutive windows. [10]

We have developed a recommendation engine that addresses

both off-line and real-time requirements. The recommendation

engine will contain an off-line training system for producing

pre-aggregation for ad recommendation to the end users. It

also will work in real-time style based on the Spark system.

This real-time recommendation system will load in-stream

data as training datasets. Moreover, real-time recommendation

system can leverage pre-aggregation results produced by off-

line batch mode trained machine learning models.

2) GEO information integration with profile datasets.

Page 7: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

To support location-based mobile advertsing capability, our

system needs several fundamental profile datasets, such as

geography information for describing several aspects of

location. There are also some other important information for

targeting and ad mapping, such as application profiles,

merchant profiles (i.e. shop profiles) as well as user profiles.

The ad related information is stored in MongoDB, a persistent

data repository which will be constantly updated. For the

client-side, the system is built with pure browser-based

HTML5 technologies.

3) Decisions based ML algorithms

The system provides the approaches based on both online

as well as offline solutions is to provide ad recommendations

as a service for ad and marketing service users. Our goal in

designing the system is to support different business vertical

datasets from various business users. The solutions should

require the minimum knowhow on business domain

knowledge. Hence, we have selected and used the Decision

Tree algorithm since it is an ideal supervised learning

approach requiring the minimum domain knowledge.

The system also uses the K-means clustering approach. Figure 5 shows the procedure below.

Figure 5 The Procedure of K-Means

4) Clustering

We choose the K-means algorithm that takes the number k, which specifies the desired number of clusters. To start, the algorithm takes the first k items as the centers of our k unique clusters. During each iteration, each remaining item is compared to the closest center and populates the clusters. In the next iterations, the cluster centers are re-computed based on cluster centers formed in the previous pass and the cluster membership is re-evaluated. The algorithm presented below is done for these items, in our case mobile user profile data

The system uses the adjusted cosine measure, as following formula, which looks at the angle between two vectors of ratings where a smaller angle is regarded as implying greater similarity. We normalize our data to remove noises. In this

version of the cosine measure, the difference in scale is taken into account.

Where Ri,c is the mobile user preferred item c by user i, Ac is the average rating of user i for all the co-rated items, and Ii,j is the items set both rating by user i and user j.

5) Similarity Analysis

A small but important step in this system is the similarity analysis. It takes the product items information and the customer information and tries to find similarity in them. It takes the help of association rules mining technique to identify the patterns of the customers. As an input, this system takes the customer‟s profile, his interests; purchase history related to advertiser or marketer product and some other information based on his profile. With the help of association rule mining, this system will try and find associations between the products to be suggested and the customer and his buying history, location, preference, and patterns. The figure below explains the methods to calculate the association and tries to predict the confidence of the customer in buying or switching to the product.

The system constructs a utility matrix using the utility function to fetch the data about the users‟ preferences and rating for past recommended or purchased products, and uses them to recommend a new product to someone else. The utility function and the matrix generated from the function

Page 8: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

looks like this. Here, the utility matrix gives a clear picture of the missing preferences for the items. The aim of this system is to fill the gaps by referring the products and its attributes fit to the customer preference.

6) Machine learning with high performance

An in-memory cluster computing platform is used that

increases performance by 100X versus traditional Hadoop

deployment. This allows our platform to load data into a

cluster's memory and query it repeatedly making it suitable

for different machine learning algorithms. This allows to

process data faster, and thus helps in scaling the application.

Figure 6 System Component Architecture

SYSTEM DESIGN AND IMPLEMENTATION

This section here discusses three items: a) system component

architecture design, b) system interface and connectivity

design, and c) recommendation engine workflow.

1) Fundamental data components (Profiles and GEO

Info): This location based system will need several

fundamental profile data such as geography information

for describing several aspects of on location. There are

also some other important information for targeting and

ad mapping. They are application profiles, merchant

profiles (shop profiles) and user profile. When all data in

the location (GEO ID) matches to each other‟s, we will

have good ads or some good candidates in the ad space of

user. Then following some business strategies they will

be prioritized.All these information is stored in

MongoDB, they will be kept refreshing.

2) Other data components: There are a few other data

components that are also important for this system. They

are recommendations and result for pushing ads to end

users. Ads and profile analysis for customer review.

These will be stored in MongoDB.

Recommendations and Click-through rate (CTR) history

are improving recommendation machine learning models.

Output files are from off-line batch mode recommendation

system based on Hadoop. There are also several training

datasets in the HDFS.

Loading Balancer: In order to support incremental and

spike workloads, our system shall support load balancing.

Analytics/Recommendation Engine: We provide a real-

time processing ad recommendation engine. This

recommendation engine will involve an off-line training

system for producing pre-aggregation for ad

recommendation to the end users. It also will work in the

real-time processing Spark system.

This real-time recommendation system will load in-stream

data as refresh training datasets. Also, this real-time

recommendation system can leverage the pre-aggregation

result produced by off-line batch mode trained machine

learning models. The recommendation engine tackles various

issues such as cold-start problem by using various machine

learning techniques to improve the performance for new users

accessing our system. It deals with both offline and real time

training of datasets to create recommendation sets for each

user. The most recent sopping history is also used to

understand user behavior.

3) Sampling Engine: For getting a smaller but

representative dataset, this engine will work with

preprocess for improving the performance and reducing

the pressure from huge data scale on the production

system

4) Security: Our system shall support RBAC – Role Based

Access Support.

5) Web services and APIs: For providing web services, we

will build up web services and APIs for getting results

from services layers. This component will be on the layer

of communication.

6) User Interface: The system serve two groups of users:

The first group refers to system customers. They are the

merchant users or ad publishers who own business and

purchase the system provided advertisement services.

They need ad publishing, ad details UI, ad content

management, ad analysis, strategies, and profit analysis.

The second group refers to system end users. They are

the advertisement receivers on mobile devices.

b) System Interface and Connectivity Design In this section we will discuss the system interface and

connectivity design of the analytics service platform. The end-user or customer (merchants) logs into the system with a device through the internet and then browses the web or clicks an app. Then the service request will be sent to the Web server and Web server communicate with the recommendation engine. The recommendation engine uses the data and algorithm stored in the database to perform real-time recommendation to the end users and customers.

c) Recommendation Engine Process Flow

Page 9: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

The whole process flow can be divided into step key steps:

Pre-process, Machine learning process and post-process. In

each of these three steps, there are several innovations.

Fig.8. Personalized Recommendation for the End-user

A CASE STUDY

We can simulate to obtain user‟s GEO information by clicking on the Google Map, see below Figure 7 and Figure 8. Based on the returned latitude and longitude, a highly personalized recommendation list is displayed for a specific user. We also look at the bubble chart by using the zip codes. Figure 9 shows the bubbles with same color belong to the same city but different zip code. The larger the bubble is, and the larger number of merchants is there. Similar location related distributions are given in Figure 10 . Here, we used the training dataset and observe the user reviews by different cities around Phoenix, AZ.

CONCLUSION AND FUTURE WORK

This paper presented a novel approach to supporting a location based ad recommendation system using the current

state of art technologies. The project provides a decision based approach to handle various use cases associated with pushing relevant ads towards the end-users. The objective of the project is to undergo the whole process of complete testing and benchmarking which would enable us to put forward a scalable big data ad processing platform in the current market. The project also provides a pilot data analytics approach for the merchants to view their end-users. Once the experimentation of this approach is beta-tested based on the reviews of the merchants, we would improve the data analytics component of our system. The scope of the current project is very large. A lot of features can be added to the system to achieve high scalability supporting real-time processing and recommendation. Currently our system handles only offline modeling and training. In the future, we plan to provide online modeling and training of datasets to enrich mobile user experience.

Figure 7.

Page 10: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

Figure.9 Number of merchants bubble chart based zip

Figure 10. User reviews by different cities around Phoenix, AZ

References

[1] Google, “Android, the world's most popular mobile platform”, [Online], 11 Jan 2015, Available: http://developer.android.com/about/index.html

[2] Chuck Jones, “Why 10 Million iPhones Means A Lot More Than 9 Million”, Sep 2013, [Online], Available: http://www.forbes.com/sites/chuckjones/2014/09/23/why-10-million-iphones-means-a-lot-more-than-9-million/

[3] James Manyika, et al., “Disruptive technologies:Advances that willtransform life, business, and the global economy”, May 2013, http://www.mckinsey.com/insights/business_technology/disruptive_technologies

[4] Cohen H.. “How Your Audience Uses Mobile Now.”,[Online], Available: http://heidicohen.com/67-mobile-facts-from-2013-research-charts/

[5] Martin Kihn and Mike McGuire, “Gartner Webinars, Mobile Marketing and Data-Driven Marketing”, Research VP,14 May 2014, http://www.gartner.com/webinar/2689618

[6] Giamas Alex, “Spark, Storm and Real Time Analytics.”, 2014, Availale: http://www.infoq.com/news/2014/01/Spark-Storm-Real-Time-Analytics

[7] Jones M. T. “Spark, an alternative for fast data analytics”, 2012,[Online] http://www.ibm.com/developerworks/library/os-spark/

[8] Jiawei Han and Micheline Kamber and Jian Pei, Morgan Kaufmann, “Data Mining: Concepts and Techniques”, Elsevier Inc. (2011)

[9] Pasquale Lops, et al., “Content-based Recommender Systems: State of the Art and Trends‟, [Online],

Page 11: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

http://www.ics.uci.edu/~welling/teaching/CS77Bwinter12/handbook/ContentBasedRS.pdf .

[10] Real-Time, Online and Offline Complex Event Processing [Online]. Avaliable: http://www.thecepblog.com/2009/02/08/real-time-online-and-offline-complex-event-processing . (Accessed: Nov. 5, 2014).

[11] X. Liu. (2013, Sep. 19). Understanding Big Data Processing and Analytics [Online]. Avaliable: http://www.developer.com/db/understanding-big-data-processing-and-analytics.html . (Accessed: Nov. 5, 2014)

[12] T. Das. Spark Streaming [online], Available: http://stanford.edu/~rezab/sparkclass/slides/td_streaming.pdf.

Page 12: Building a Big Data Analytics Service Framework for Mobile ...cis.csuohio.edu/~sschung/CIS601/MobileBased Spark Recommendat… · Building a Big Data Analytics Service Framework for

All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.