Data Scientist's Daily Life
-
Upload
li-wei-yang -
Category
Documents
-
view
388 -
download
1
Transcript of Data Scientist's Daily Life
![Page 1: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/1.jpg)
DATA S C I E N T I S T ’ S DA I LY L I F E
![Page 2: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/2.jpg)
AG E N DA
• Data scientist?
• Big data and data scientist
• Data scientist’s Toolbox
• Data is the biggest
![Page 3: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/3.jpg)
Derive Knowledge
fromBig data
![Page 4: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/4.jpg)
Efficiently
and
Intelligently
![Page 5: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/5.jpg)
F R O M BAC K E N D T O F R O N T E N D
https://doubleclix.wordpress.com/2012/12/15/what-or-who-is-a-data-scientist/
![Page 6: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/6.jpg)
W H AT I S B I G DATA ?
![Page 7: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/7.jpg)
W H E R E D O T H E DATA C OM E F R OM
• Web Log data
• Machine data
• Transactional data
• Social media data
• …
![Page 8: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/8.jpg)
https://plus.google.com/+DigitalStrategyIE
![Page 9: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/9.jpg)
![Page 10: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/10.jpg)
A WE B SE RV I C E RE C E I VE T H E LOG DATA M ORE T H E N 50G PE R DAYT OTAL SPAC E US E D L AST T H RE E M ONT H : 4500GT OTAL SPAC E US E D L AST ONE Y E AR : 18 , 000G (17 .6T )
![Page 11: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/11.jpg)
• Data Storage/ Backup
• 2T/per HDD
• How to save the data MORE than 2T?
• $0.3 USD/per gigabyte
• Pay 900 USR for KEEPING data but do nothing else.
• Read/Write Speed
• Read: 131.6 MB/s / Write 131.4MB/s
• Spend 393s(6 min) reading just ONE day data.
• Large number of transactions immediately
![Page 12: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/12.jpg)
H A DO O P AN D M A P R E D U C E
![Page 13: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/13.jpg)
H A D O O P A N D H D FS
http://www.fraudtechwire.com/f-level-guide-to-hadoop-hdfs/
![Page 14: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/14.jpg)
![Page 15: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/15.jpg)
![Page 16: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/16.jpg)
– D I S T R I BUT ED A LG OR I THM
「 The world will change,when data is distributed」
![Page 17: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/17.jpg)
M A P R E D U C E
http://www.milanor.net/blog/?p=853
![Page 18: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/18.jpg)
https://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/
![Page 19: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/19.jpg)
http://blog.agro-know.com/?p=3810
![Page 20: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/20.jpg)
P E R F O R M A N C E O F H A D OO P ?
• Not good, but at least can run.
• Count 86,389,084 rows/per day in 39 sec. (64G ram, E5 8core * 2/per node * 10)
• How about 39sec * 30days ?
![Page 21: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/21.jpg)
B E F O R E A N A LY T I C …
![Page 22: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/22.jpg)
E XT RAC T T RA S F O R M LOA D
http://www.wisdomjobs.com/e-university/data-warehouse-etl-toolkit-tutorial-201/surrounding-the-requirements-1319/architecture-8029.html
![Page 23: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/23.jpg)
http://www.slideshare.net/capgemini/emc-world-2014-breakout-move-to-the-business-data-lake-not-as-hard-as-it-sounds
![Page 24: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/24.jpg)
http://www.slideshare.net/hortonworks/modern-data-architecture-for-a-data-lake-with-informatica-and-hortonworks-data-platform
![Page 25: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/25.jpg)
DATA S C I E N T I S T ’ S T O O L BOX
![Page 26: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/26.jpg)
L I N U X
• The best server choice
• Free and freedom
• Easy to control system
• Easy data processing
• Hadoop is based on Linux
![Page 27: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/27.jpg)
![Page 28: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/28.jpg)
P O W E R F U L S H E L L S C R I PT
![Page 29: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/29.jpg)
S QL DATA BA S E
• MySql, Postgresql, Hive, MongoDB(NOSQL)
• Standard SQL Language
• Store and Manage data
![Page 30: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/30.jpg)
R E L AT I O N A L DATA BA S E
![Page 31: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/31.jpg)
TA BL E R E L AT I O N
https://cloudant.com/blog/foundbites-data-model-relational-db-vs-nosql-on-cloudant/
![Page 32: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/32.jpg)
http://ghtorrent.org/relational.html
![Page 33: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/33.jpg)
S QL S Y N TA X
![Page 34: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/34.jpg)
R & PY T H O N
• Basic Analysis Tools
• Easy to Learn
• Many Packages
![Page 35: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/35.jpg)
![Page 36: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/36.jpg)
![Page 37: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/37.jpg)
• Example
• http://bryannotes.blogspot.tw/2014/08/r-ptt-wantedsocial-network-analysis.html
• http://bryannotes.blogspot.tw/2014/10/python-k-means-script.html
![Page 38: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/38.jpg)
E TC …
• Excel
• Google Analytics
• Visualisation tools (tableau)
• Web Crawler
• Version control management (git)
• ETL and job scheduling tools (jenkins)
• …
![Page 39: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/39.jpg)
DATA I S T H E B I G G E S T
![Page 40: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/40.jpg)
– J OS H W I LLS
“Person who is better at statistics than any software engineer and better at software
engineering than any statistician.”
![Page 41: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/41.jpg)
S TAT I S T I C
![Page 42: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/42.jpg)
W H Y D O W E N E E D M AC H I N E L E A R N I N G ?
• Clustering這些人可以分成幾類
• Classification哪個人屬於哪一類?
• Regression某個事件發生或某人屬於哪類的機率是多少?
• Dimensionality reduction降維
![Page 43: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/43.jpg)
C LU S T E R I N G
http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/
source http://humble-developer.blogspot.tw/2011/01/kmeans-clustering-algorithm-part-1.html
![Page 44: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/44.jpg)
C L A S S I F I C AT I O N
http://letsmakerobots.com/content/tcs3200-color-sensor-with-k-nearest-neighbor-classification-algorithm
![Page 45: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/45.jpg)
http://www.astroml.org/sklearn_tutorial/
![Page 46: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/46.jpg)
LO G I S T I C R E G R E S S I O N
https://www.coursera.org/instructor/andrewng
![Page 47: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/47.jpg)
C O S T F U N C T I O N
https://www.coursera.org/instructor/andrewng
![Page 48: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/48.jpg)
OV E R F I TT I N G
https://www.coursera.org/instructor/andrewng
![Page 49: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/49.jpg)
O H M Y G O D !H O W T O C H O O S E I T
![Page 50: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/50.jpg)
M AC H I N E L E A R N I N G A L G OR I T H M N
http://amueller.github.io/sklearn_tutorial/
![Page 51: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/51.jpg)
S TAT I S T I C V S M L
S TATT I S T I C MAC H I NEL E ARN I NG
FOC U S ON U NDE RS TAND I NG DATA I N TER MS OF MODEL S
FOC U S ON THE ANALYS I S OF L EAR N I NG AL G OR I THMS
I NTER P R ETAB I L I TY , HYP OTHES I S TE S T I NG
G R EATE R FOC U S ON P R ED I C T I ON
![Page 52: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/52.jpg)
S Y S T E M AT I C S A N D A U T OM AT I O N
http://www.slideshare.net/CetasAnalytics/cetas-e-baymeetupprezofinal
![Page 53: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/53.jpg)
http://mlg.postech.ac.kr/projects/
![Page 54: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/54.jpg)
S H O W YO U R DATA AN D F I N D I N G S
![Page 55: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/55.jpg)
http://hortonworks.com/wp-content/uploads/2012/06/Tableau2.png
![Page 56: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/56.jpg)
http://www.tableau.com
![Page 57: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/57.jpg)
http://www.tableau.com
![Page 58: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/58.jpg)
http://www.tableau.com
![Page 59: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/59.jpg)
T H E R E A L C A S E
![Page 60: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/60.jpg)
H O W T O S TA RT ?
![Page 61: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/61.jpg)
• Codecademy http://www.codecademy.com/Include kinds of programming language, i.e. python, JavaSrtipt, even shell script and sql
• Coursera http://www.codecademy.com/Famous self-learning MOOC website.
![Page 62: Data Scientist's Daily Life](https://reader035.fdocuments.us/reader035/viewer/2022062313/55d2b063bb61eb6d3a8b46b1/html5/thumbnails/62.jpg)
http://nirvacana.com/thoughts/becoming-a-data-scientist/