Data Scientist's Daily Life

Post on 18-Aug-2015

388 views 1 download

Tags:

Transcript of Data Scientist's Daily Life

DATA S C I E N T I S T ’ S DA I LY L I F E

AG E N DA

• Data scientist?

• Big data and data scientist

• Data scientist’s Toolbox

• Data is the biggest

Derive Knowledge

fromBig data

Efficiently

and

Intelligently

F R O M BAC K E N D T O F R O N T E N D

https://doubleclix.wordpress.com/2012/12/15/what-or-who-is-a-data-scientist/

W H AT I S B I G DATA ?

W H E R E D O T H E DATA C OM E F R OM

• Web Log data

• Machine data

• Transactional data

• Social media data

• …

https://plus.google.com/+DigitalStrategyIE

A WE B SE RV I C E RE C E I VE T H E LOG DATA M ORE T H E N 50G PE R DAYT OTAL SPAC E US E D L AST T H RE E M ONT H : 4500GT OTAL SPAC E US E D L AST ONE Y E AR : 18 , 000G (17 .6T )

• Data Storage/ Backup

• 2T/per HDD

• How to save the data MORE than 2T?

• $0.3 USD/per gigabyte

• Pay 900 USR for KEEPING data but do nothing else.

• Read/Write Speed

• Read: 131.6 MB/s / Write 131.4MB/s

• Spend 393s(6 min) reading just ONE day data.

• Large number of transactions immediately

H A DO O P AN D M A P R E D U C E

H A D O O P A N D H D FS

http://www.fraudtechwire.com/f-level-guide-to-hadoop-hdfs/

– D I S T R I BUT ED A LG OR I THM

「 The world will change,when data is distributed」

M A P R E D U C E

http://www.milanor.net/blog/?p=853

https://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/

http://blog.agro-know.com/?p=3810

P E R F O R M A N C E O F H A D OO P ?

• Not good, but at least can run.

• Count 86,389,084 rows/per day in 39 sec. (64G ram, E5 8core * 2/per node * 10)

• How about 39sec * 30days ?

B E F O R E A N A LY T I C …

E XT RAC T T RA S F O R M LOA D

http://www.wisdomjobs.com/e-university/data-warehouse-etl-toolkit-tutorial-201/surrounding-the-requirements-1319/architecture-8029.html

http://www.slideshare.net/capgemini/emc-world-2014-breakout-move-to-the-business-data-lake-not-as-hard-as-it-sounds

http://www.slideshare.net/hortonworks/modern-data-architecture-for-a-data-lake-with-informatica-and-hortonworks-data-platform

DATA S C I E N T I S T ’ S T O O L BOX

L I N U X

• The best server choice

• Free and freedom

• Easy to control system

• Easy data processing

• Hadoop is based on Linux

P O W E R F U L S H E L L S C R I PT

S QL DATA BA S E

• MySql, Postgresql, Hive, MongoDB(NOSQL)

• Standard SQL Language

• Store and Manage data

R E L AT I O N A L DATA BA S E

TA BL E R E L AT I O N

https://cloudant.com/blog/foundbites-data-model-relational-db-vs-nosql-on-cloudant/

http://ghtorrent.org/relational.html

S QL S Y N TA X

R & PY T H O N

• Basic Analysis Tools

• Easy to Learn

• Many Packages

E TC …

• Excel

• Google Analytics

• Visualisation tools (tableau)

• Web Crawler

• Version control management (git)

• ETL and job scheduling tools (jenkins)

• …

DATA I S T H E B I G G E S T

– J OS H W I LLS

“Person who is better at statistics than any software engineer and better at software

engineering than any statistician.”

S TAT I S T I C

W H Y D O W E N E E D M AC H I N E L E A R N I N G ?

• Clustering這些人可以分成幾類

• Classification哪個人屬於哪一類?

• Regression某個事件發生或某人屬於哪類的機率是多少?

• Dimensionality reduction降維

C LU S T E R I N G

http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/

source http://humble-developer.blogspot.tw/2011/01/kmeans-clustering-algorithm-part-1.html

C L A S S I F I C AT I O N

http://letsmakerobots.com/content/tcs3200-color-sensor-with-k-nearest-neighbor-classification-algorithm

http://www.astroml.org/sklearn_tutorial/

LO G I S T I C R E G R E S S I O N

https://www.coursera.org/instructor/andrewng

C O S T F U N C T I O N

https://www.coursera.org/instructor/andrewng

OV E R F I TT I N G

https://www.coursera.org/instructor/andrewng

O H M Y G O D !H O W T O C H O O S E I T

M AC H I N E L E A R N I N G A L G OR I T H M N

http://amueller.github.io/sklearn_tutorial/

S TAT I S T I C V S M L

S TATT I S T I C MAC H I NEL E ARN I NG

FOC U S ON U NDE RS TAND I NG DATA I N TER MS OF MODEL S

FOC U S ON THE ANALYS I S OF L EAR N I NG AL G OR I THMS

I NTER P R ETAB I L I TY , HYP OTHES I S TE S T I NG

G R EATE R FOC U S ON P R ED I C T I ON

S Y S T E M AT I C S A N D A U T OM AT I O N

http://www.slideshare.net/CetasAnalytics/cetas-e-baymeetupprezofinal

http://mlg.postech.ac.kr/projects/

S H O W YO U R DATA AN D F I N D I N G S

http://hortonworks.com/wp-content/uploads/2012/06/Tableau2.png

http://www.tableau.com

http://www.tableau.com

http://www.tableau.com

T H E R E A L C A S E

H O W T O S TA RT ?

• Codecademy http://www.codecademy.com/Include kinds of programming language, i.e. python, JavaSrtipt, even shell script and sql

• Coursera http://www.codecademy.com/Famous self-learning MOOC website.

http://nirvacana.com/thoughts/becoming-a-data-scientist/