Data Analysis - Making Big Data Work
-
Upload
david-chiu -
Category
Data & Analytics
-
view
1.576 -
download
1
description
Transcript of Data Analysis - Making Big Data Work
![Page 1: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/1.jpg)
Data Analysis Making Big Data Work
David Chiu
2014/11/24
![Page 2: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/2.jpg)
About Me
Founder of LargitData
Ex-Trend Micro Engineer
ywchiu.com
![Page 3: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/3.jpg)
Big Data & Data Science
![Page 4: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/4.jpg)
US Election Prediction
4
![Page 5: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/5.jpg)
World Cup Prediction
![Page 6: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/6.jpg)
Hurricane Prediction
![Page 7: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/7.jpg)
Data Science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
![Page 8: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/8.jpg)
![Page 9: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/9.jpg)
Being A Data Scientist, You Need to Know
That Much? Seriously?
![Page 10: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/10.jpg)
Statistic
Single Variable、Multi Variable、ANOVA
Data Munging
Data Extraction, Transformation, Loading
Data Visualization
Figure, Business Intelligence
Required Skills
![Page 11: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/11.jpg)
What You Probably Need Is A Team
Business Analyst Knowing how to use different tools under
different circumstance
Statistician How to process
big data?
DBA How to deal with unstructured data
Software Engineer
Knowing how to user statistics
![Page 12: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/12.jpg)
Four Dimension
12
Single Machine Memory R Local File
Cloud Distributed Hadoop HDFS
Statistics Analysis Linear Algebra
Architect Management Standard
Concept MapReduce Linear Algebra Logistic Regression
Tool Hadoop PostgreSQL R
Analyst How to use these tools
Hackers R Python Java
![Page 13: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/13.jpg)
“80% are doing summing and averaging”
Content
1. Data Munging
2. Data Analysis
3. Interpret Result
What Data Scientists Do?
![Page 14: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/14.jpg)
Application of Data Analysis
Text Mining
Classify Spam Mail
Build Index
Data Search Engine
Social Network
Analysis
Finding Opinion
Leader
Recommendation
System
What user likes?
Opinion Mining
Positive/Negative
Opinion
Fraud Analysis
Credit Card Fraud
![Page 15: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/15.jpg)
Feed data to computer
Make Computer to Do Analysis
![Page 16: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/16.jpg)
Let Computer Predict For You
![Page 17: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/17.jpg)
Predictive Analysis
Learn from experience (Data), to predict future
behavior
What to Predict?
e.g. Who is likely to click on that ad?
For What?
e.g. According to the click possibility and revenue to
decide which ad to show.
Predictive Analysis
![Page 18: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/18.jpg)
Customer buying beer will also buy pampers?
People are surfing telephone fee rate are likely
to switch its vendor
People belong to same group are tend to have
same telecom vendor
Surprising Conclusion
![Page 19: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/19.jpg)
According to personal behavior, predictive model
can use personal characteristic to generate a
probabilistic score, which the higher the score,
the more likely the behavior.
Predictive Model
![Page 20: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/20.jpg)
Linear Model
e.g. Based on a cosmetic ad. We can give 90%
weight to female customers, give10% to male
customer. Based on the click probability (15%), we
can calculate the possibility score (or probability)
Female 13.5%,Male1.5%
Rule Model
e.g.
If the user is “She”
And Income is over 30k
And haven’t seen the ad yet
The click rate is 11%
Simple Predictive Model
![Page 21: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/21.jpg)
Induction
From detail to general
A computer program is said to learn from experience E with respect to
some task T and some performance measure P, if its performance on T,
as measured by P, improves with experience E
-- Tom Mitchell (1998)
Discover an effective model
Start from a simple model
Update the model based on feeding data
Keep on improving prediction power
Machine Learning
![Page 22: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/22.jpg)
Statistic Analysis
Regression Analysis
Clustering
Classification
Recommendation
Text Mining
Application
22
![Page 23: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/23.jpg)
Image recognition
![Page 24: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/24.jpg)
Decision Tree
Rate > 1,299/Month
Probability to switch vendor
15%
Probability to switch vendor
3%
Yes No
![Page 25: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/25.jpg)
Decision Tree
Rate > 1,299/Month
Probability to switch vendor
3%
Yes No
Probability to switch vendor
10%
Probability to switch vendor
22%
Income>22,000
Yes No
![Page 26: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/26.jpg)
Decision Tree
Rate > 1,299/Month
Yes No
Probability to switch vendor
10%
Probability to switch vendor
22%
Income>22,000
Yes No
Probability to switch vendor
1%
Probability to switch vendor
7%
Free for intranet
Yes No
![Page 27: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/27.jpg)
Supervised Learning
Regression
Classification
Unsupervised Learning
Dimension Reduction
Clustering
Machine Learning
![Page 28: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/28.jpg)
Supervised Learning
![Page 29: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/29.jpg)
Classification
e.g. Stock prediction on bull/bear market
Regression
e.g. Price prediction
Supervised Learning
![Page 30: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/30.jpg)
Dimension Reduction
e.g. Making a new index
Clustering
e.g. Customer Segmentation
Unsupervised Learning
![Page 31: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/31.jpg)
Lift
The better the lift, the greater the cost?
The more decision rule, the more campaign?
Design strategy for different persona?
The lift for 4 campaign?
The lift for 20 ampaign?
Lift
![Page 32: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/32.jpg)
Can we use the production rate of butter to
predict stock market?
Overfitting
![Page 33: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/33.jpg)
Use noise as information
Over assumption
Over Interpretation
What overfitting learn is not truth
Like memorize all answers in a single test.
Overfitting
![Page 34: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/34.jpg)
Testing Model
Use external data or partial data as testing dataset
![Page 35: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/35.jpg)
Traditional Analysis Tool
![Page 36: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/36.jpg)
Statistics On The Fly
Built-in Math and Graphic Function
Free and Open Source
http://cran.r-project.org/src/base/
R Language
36
![Page 37: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/37.jpg)
Functional Programming
Use Function Definition To Retrieve Answer
Interpreted Language
Statistics On the Fly
Object Oriented Language
S3 and S4 Method
R Language
![Page 38: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/38.jpg)
Most Used Analytic Language
Most popular languages are R,
Python (39%), SQL (37%). SAS
(20%).
By Gregory Piatetsky, Aug 27,
2013.
![Page 40: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/40.jpg)
Data Scientist in Google and Apple Use R
What is your programming language of choice, R,
Python or something else?
“I use R, and occasionally matlab, for data analysis. There is
a large, active and extremely knowledgeable R community at
Google.” http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/
“Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred” http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data-scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook
![Page 41: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/41.jpg)
Discover which customer is likely to churn?
Customer Churn Analysis
![Page 42: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/42.jpg)
Account Information
state
account length.
area code
phone number
User Behavior
international plan
voice mail plan, number vmail messages
total day minutes, total day calls, total day charge
total eve minutes, total eve calls, total eve charge
total night minutes, total night calls, total night charge
total intl minutes, total intl calls, total intl charge
number customer service calls
Target
Churn (Yes/No)
Data Description
![Page 43: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/43.jpg)
> install.packages("C50")
> library(C50)
> data(churn)
> str(churnTrain)
> churnTrain = churnTrain[,! names(churnTrain) %in% c("state",
"area_code", "account_length") ]
> set.seed(2)
> ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3))
> trainset = churnTrain[ind == 1,]
> testset = churnTrain[ind == 2,]
Split data into training and testing
dataset
70% as training dataset 30% as testing dataset
![Page 44: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/44.jpg)
churn.rp <- rpart(churn ~ ., data=trainset)
plot(churn.rp, margin= 0.1)
text(churn.rp, all=TRUE, use.n = TRUE)
Build Classifier
Classfication
![Page 45: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/45.jpg)
> predictions <- predict(churn.rp, testset, type="class")
> table(testset$churn, predictions)
Prediction Result
pred no yes
no 859 18
yes 41 100
![Page 46: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/46.jpg)
> confusionMatrix(table(predictions, testset$churn))
Confusion Matrix and Statistics
predictions yes no
yes 100 18
no 41 859
Accuracy : 0.942
95% CI : (0.9259, 0.9556)
No Information Rate : 0.8615
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7393
Mcnemar's Test P-Value : 0.004181
Sensitivity : 0.70922
Specificity : 0.97948
Pos Pred Value : 0.84746
Neg Pred Value : 0.95444
Prevalence : 0.13851
Detection Rate : 0.09823
Detection Prevalence : 0.11591
Balanced Accuracy : 0.84435
'Positive' Class : yes
Use Confusion Matrix
![Page 47: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/47.jpg)
Use Testing Data to Validate Result
predictions <- predict(churn.rp, testset, type="prob")
pred.to.roc <- predictions[, 1]
pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])]))
perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff")
perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr")
plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",([email protected])))
![Page 48: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/48.jpg)
Finding Most Important Variable
model=fit(churn~.,trainset,model="svm")
VariableImportance=Importance(model,trainset,method="sensv")
L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$
sresponses)
mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)
![Page 49: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/49.jpg)
Dynamic Language
Execution at runtime
Dynamic Type
Interpreted Language
See the result after execution
OOP
Python Language
49
![Page 50: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/50.jpg)
Cross Platform(Python VM)
Third-Party Resource
(Data Analysis、Graphics、Website Development)
Simple, and easy to learn
Benefit of Python
![Page 51: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/51.jpg)
Data Analysis
Scipy
Numpy
Scikit-learn
Pandas
51
![Page 52: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/52.jpg)
Company that use python
52
![Page 53: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/53.jpg)
Use InfoLite Tool To Extract DOM
![Page 54: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/54.jpg)
Use Python To Build Up Dashboard
![Page 55: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/55.jpg)
Monitor Social Media and News
Monitor post on social media
Configure keyword and alert
Use line plot to show daily post statistics
55
蘋果, nownews, udn, 中央跟風傳媒 還有其他財經媒體
![Page 56: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/56.jpg)
Daily Statistics Report
56
![Page 57: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/57.jpg)
Examine Associate Article
57
![Page 58: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/58.jpg)
Configure Alert and Keyword
58
![Page 59: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/59.jpg)
Configure Monitor Channel
59
![Page 60: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/60.jpg)
Track Specific Article
60
![Page 61: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/61.jpg)
Have You Learned Big Data?
61
![Page 62: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/62.jpg)
![Page 63: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/63.jpg)
The 3Vs of Big Data
![Page 64: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/64.jpg)
![Page 65: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/65.jpg)
Product
Centric
Customer
Centric
Product Centric v.s. Customer Centric
![Page 66: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/66.jpg)
Customer Centric?
http://goo.gl/iuy4lY
![Page 67: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/67.jpg)
Personal Recommendation
![Page 68: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/68.jpg)
Knowing Who You Are?
Personal recommendation
Customer relation management
Knowing What Futures Likes?
From the history, we can see the future
Predictive analysis
Knowing What is Hidden Beneath?
Correlation, Correlation, Correlation
So… What is Big Data?
![Page 69: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/69.jpg)
So… How To Analyze?
![Page 70: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/70.jpg)
Apache Project – From Yahoo
Feature
Extensible
Cost Effective
Flexible
High Fault Tolerant
Hadoop
![Page 71: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/71.jpg)
Hadoop Eco System
HDFS
MR IMPALA HBASE
PIG HIVE
SQOOP FLUME
HUE, Oozie, Mahout
![Page 72: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/72.jpg)
Tools for different scale
Size Classification Tools
Lines Sample Data
Analysis and Visualisation Whiteboard, Bash, ...
KBs – low MBs Prototype Data
Analysis and Visualisation Matlab, Octave, R, Processing, Bash, ...
MBs – low GBs Online Data
Storage MySQL (DBs), ...
Analysis NumPy, SciPy, Pandas, Weka..
Visualisation Flare, AmCharts, Raphael
GBs – TBs – PBs Big Data
Storage HDFS, Hbase, Cassandra,...
Analysis
Hive, Giraph, Hama, Mahout
![Page 73: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/73.jpg)
Amazon
![Page 74: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/74.jpg)
![Page 75: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/75.jpg)
Recommendation System
Javascript
Flume
HDFS
HBase Pig
Mahout
![Page 76: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/76.jpg)
Item- Based
![Page 77: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/77.jpg)
User - Based
![Page 78: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/78.jpg)
Monitor User Rating
![Page 79: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/79.jpg)
Send User Behavior to Backend
![Page 80: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/80.jpg)
Use Flume To Collect Streaming Data
From /tmp/postlog.txt To /user/cloudera/flume
![Page 81: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/81.jpg)
JSON sample data
{"food":"Tacos", "person":"Alice", "amount":3}
{"food":"Tomato Soup", "person":"Sarah", "amount":2}
{"food":"Grilled Cheese", "person":"Alex", "amount":5}
Demo Code
second_table = LOAD 'second_table.json'
USING JsonLoader('food:chararray, person:chararray,
amount:int');
Use Pig To Load JSON
![Page 82: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/82.jpg)
Build Recommendation Model
![Page 83: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/83.jpg)
$ hbase shell
> create ‘mydata’, ‘mycf’
Build Table In HBase
![Page 84: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/84.jpg)
Examine Data In HDFS
![Page 85: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/85.jpg)
Use Pig To Transfer Data Into HBase
![Page 86: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/86.jpg)
Examine Data In HBase
![Page 87: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/87.jpg)
Build API
![Page 88: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/88.jpg)
Recommendation System
![Page 89: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/89.jpg)
Focus on algorithm
Divide and Conquer, Trie, Collaborative Filtering
Being an expert of single programming language
But knowing what tools and algorithm you can use to
solve your problem
Define your role
Statistician
Software engineer
What You Should Do
![Page 90: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/90.jpg)
Website:
largitdata.com
ywchiu.com
Email:
Contacts
![Page 91: Data Analysis - Making Big Data Work](https://reader033.fdocuments.us/reader033/viewer/2022052623/559b87d71a28ab67158b45b2/html5/thumbnails/91.jpg)