Lecture 1 (01/23, 01/28): Introduction to Big Data ...kpzhang/teaching/budt... · Lecture 1 (01/23,...
Transcript of Lecture 1 (01/23, 01/28): Introduction to Big Data ...kpzhang/teaching/budt... · Lecture 1 (01/23,...
Lecture 1 (01/23, 01/28): Introduction to Big Data Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2019
K. Zhang BUDT 758
BusinessValueofBigDataandAI
3
4http://www.salesforcehacker.com/2014/11/hadoop-and-pig-come-to-salesforce.html
“AirpalisaWeb-baseddata-explorationandSQLqueryinterfacethatrunsonPresto,thein-memorySQL-on-HadoopquerytechnologythatFacebookdonatedtoApacheopensourceinlate2013.AirbnbinventedAirpalbecauseitneededatoolthatwouldbemoreaccessibletodataanalystsandevenbusinessusers,notjustthe23-personAirbnbdatascienceteamthathandlesHiveandPrestoqueries.”
----Airbnb3/5/201511:35AM
5
• “VCinvestmentinthespaceremainsvibrantandthefirstfewweeksof2016sawaflurryofannouncementsofbigfoundingroundsforlatestageBigDatastartups:DataDog($94M),BloomReach($56M),Qubole($30M),PlaceIQ($25M),etc.BigDatastartupsreceived$6.64Binventurecapitalinvestmentin2015,11%oftotaltechVC.”
6
http://www.goldmansachs.com/our-thinking/pages/big-data.html
BigDataecosystem
7
Theopensourcecommunity• Yahoo!
q Hadoop,Pigq PighidesJavaprogramming
• Facebookq Hive:providesSQLtypefunctionsforHadoopfiles
• Netflixq Hbase:massagebigdatatobelikeadatabase
• UCBerkeleyq Spark:in-memoryprocessingtoavoidthelowdiskI/O
• Twitterq Storm:nearreal-timestreamingdata
8
Technologyisstillevolvingrapidly
9
Andtheal-mightyAI!• 2012Matlab• 2013Caffe• 2014Theano• 2015Torch• 2016/7TensorFlow• 2018???(PyTorch)
• CNN,RNN,GANs…
• SergeyBrin@2017DavosWorldEconomicForum– https://www.youtube.com/watch?v=jYuCVcGxtNM
10
So,what’sgoingon?
• Youneedcriticalthinkingtonotgetlost
11
Howdoesdatageneratevalue?
12
Bigdataprocesses
• Loaddata• Cleanupdata• Transformdata• Querydata• Machinelearning/deeplearning
13
RealizingthebenefitsofBigData
• SettingupHadoopisjustthebeginning!q Itjustmeansthatyouareenabledtohandlethebigdata
q Butdoesnotguaranteeanybenefit!– Mightwasteyourmoneyanddivertyourattention.
14
Theeasyones
• Fasterandcheaperq Inlate2007,theNewYorkTimeswantedtomakeavailableoverthewebitsentirearchiveofarticles,11millioninall,datingbackto1851.Four-terabytepileofimagesinTIFFformatneededtotranslatethatfour-terabytepileofTIFFsintomoreweb-friendlyPDFfiles.• Notaparticularlycomplicatedbutlargecomputingchore,
q requiringawholelotofcomputerprocessingtime.
15
• asoftwareprogrammerattheTimes,DerekGottfrid,q playingaroundwithAmazonWebServices,ElasticComputeCloud
(EC2),• uploadedthefourterabytesofTIFFdataintoAmazon'sSimpleStorageSystem(S3)
• Inlessthan24hours,11millionsPDFs,allstoredneatlyinS3andreadytobeserveduptovisitorstotheTimessite.
• Thetotalcostforthecomputingjob?$240q 10centspercomputer-hourtimes100computerstimes24hours
16
Howtomakedata“actionable”
• D-D-P-P
q Descriptive:whathappened?q Diagnostic:whydidithappen?q Predictive:whatislikelytohappen?q Prescriptive:whatisthebestcourseofaction?
17
CourtesyofCupidChan
Traditionalvs.BigDataApproach
18
Adynamicprocess• Whatarethebusinessgoalsandcriticalissues?• Whatdatadoyouhave?• Whatdatacanyoupotentiallycapture?• Whatanalyticaltoolscouldbeapplied?
Goals Data
Goal:findbusinessquestionsthatcanharnessthe
powerofbigdata19