Business of Big Data
-
Upload
leonid-zhukov -
Category
Technology
-
view
126 -
download
4
description
Transcript of Business of Big Data
![Page 1: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/1.jpg)
Big Datathe next frontier
Leonid Zhukov Professor Higher School of Economics
1
RVC SeminarMoscow, 08/02/2013
![Page 2: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/2.jpg)
Big data
+ Graph of terms popularity
2
www.visibletechologies.com
![Page 4: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/4.jpg)
Headlines 4
Data driven business
Data democratization
Data scientists
![Page 5: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/5.jpg)
The White House
+ $200M initiative
+ NSF: core techniques
+ NIH: 1000 genomes
+ DOE: advanced computing
+ DOD: data to decisions
+ USGS: Earth system
5
www.whitehouse.gov
![Page 7: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/7.jpg)
Market Forecast
+ Venture money invested (Reuters):+ 2009 - $1.1B+ 2010 - $1.53B+ 2011 - $2.47B
7
www.wikibon.com
+ Market forecasts:+ IDC: 2015 - $16.9B+ Gartner: 2016- $55B
![Page 8: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/8.jpg)
Big Data Revenue 2012 8
+ Big Business:
+ IBM+ HP+ Oracle+ Teradata+ EMC www.wikibon.com
![Page 9: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/9.jpg)
Big Data Vendors!
+ Hadoop:+ Cloudera+ MapR Techonologies+ HortonWorks
9
www.wikibon.com
![Page 11: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/11.jpg)
What is big data 11
+ Big data:
+ “Data you can’t process by traditional tools”
+ “A phenomenon defined by the rapid acceleration in the expanding volume of high velocity, complex and diverse types of data.”
+ “Refers to a collection of tools, techniques and technologies for working with data productively, at any scale.”
![Page 12: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/12.jpg)
What is Big data 12
+ 3V
+ Volume: petabytes (1000TB) to exabytes (1000PB)
+ Variety: structured, semi-structured, unstructured
+ Velocity: Tb/s data streams
+ Requires distributed processing
+ Big data = storage + processing
+ Big data = Hadoop (not only)
![Page 13: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/13.jpg)
Big data Glossary
+ Hadoop, MapReduce, Hive, Pig, Cascading, HBase, Hypertable, Cassandra, Flume, Sqoop, Mongo, Voldemort, Storm, Kafka, Drill, Dremmel, Impala, Zookeeper, Ambari, Oozi, Yarn, Redis, Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R, Mahout, Weka,
13
![Page 14: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/14.jpg)
How big is Big?
+ Google + 24 PB data processed daily
+ Twitter+ 340 mln daily tweets+ 1.6 bln search queries+ 7 TB added daily
+ Facebook+ 750 mln users + 12 TB daily daily content+ 2.7 bln “likes” and comments daily
14
![Page 16: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/16.jpg)
Supercomputing
+ National Labs, Universities, Military
+ Processing power, flops, MPI
+ Parallel computing:
+ Cray, IBM SP, SGI
+ Beowulf cluster (Linux commodity)
16
![Page 17: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/17.jpg)
New realities
+ Yahoo, AltaVista, Inktomi, Google
+ Consumer web companies:
+ web search (crawling, indexing)
+ advertising
+ email services
+ ecommerce
+ Commodity hardware
17
![Page 18: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/18.jpg)
Google 18
2003 2004
![Page 19: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/19.jpg)
GFS/HDFS
+ Distributed replicated data blocks (64Mb)
+ Master-slave architecture (Name Node, Data Nodes)
+ Not a general file system
+ Access via command line utils and API
+ Can’t modify after files written
19
![Page 20: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/20.jpg)
MapReduce 20
+ MapReduce programming model:+ functional programming+ like UNIX pipeline
+ Master-slave architecture+ Master: divide, schedule, monitor work+ Slave: actual processing
+ Scalable:+ no file IO+ no networking+ no synchronization
![Page 21: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/21.jpg)
Data movement 21
www.cloudera.com
+ store and process data on the same nodes
+ bring code to data, data “locality”
![Page 22: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/22.jpg)
Hadoop
+ Doug Cutting
+ Search indexer - Lucene
+ Web crawler - Nutch
+ Hadoop
+ HDFS
+ MapReduce
22
![Page 23: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/23.jpg)
Yahoo!
+ 40,000 servers
+ 170PB storage
+ 1000+ active users
+ 5M+ monthly jobs
+ email spam filters
+ categorization, personalization
+ computational advertising
23
![Page 24: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/24.jpg)
Data Base NoSQL Revolution
+ Needed:
+ fast read/write time
+ high concurrency
+ easy horizontally scalable
+ Flat data structure
+ Sacrificed:
+ DB Schema
+ SQL
+ Transactions
24
![Page 25: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/25.jpg)
NoSQL World 25
+ Key-value: Dynamo, Voldemort, Redis, Riak
+ Column (tabular): HBase, Hypertable, Cassandra
+ Document store: CouchDB, MongoDB
+ Graph: Neo4J, FlockDB
+ 120+ products (2012)
![Page 27: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/27.jpg)
Hadoop tools
+ Pig
+ high level scripting language (PigLatin)
+ converts to MapReduce jobs
+ Hive
+ SQL like queries on dat in HDFS
+ converts in MapReduce jobs
27
![Page 29: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/29.jpg)
Typical hadoop usage
+ Text mining+ Pattern recognition+ Recommendation systems (collaborative filtering)+ Prediction models+ Risk assessment+ Sentiment analysis+ Customer churn prediction+ Customer segmentation+ Point of Sale Transaction analysis+ Data “sandbox”
29
![Page 30: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/30.jpg)
Application fields
+ Science: sensors, genome, weather, satellite, imaging
+ Engineering: log analytics, status feeds, network messages, spam filters..
+ Product: financial, pharmaceutical, insurance, energy, retail, ecommerce, healthcare, telecom
+ Business: analytics, BI
30
![Page 31: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/31.jpg)
Business analytics
+ Analytic
+ Operational
31
www.datasciencecentral.com
Capture, analyze, learn from data
![Page 33: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/33.jpg)
Why Hadoop? 33
www.thinkbiganalytics.com
![Page 34: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/34.jpg)
Cloudera
+ Enterprise support for Apache Hadoop
+ Founded 2008, funding $141 M
+ Employee 230
+ Products:
+ CDH 4 (cloudera distrobution hadoop)+ Impala+ Consulting and training
34
www.cloudera.com
![Page 35: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/35.jpg)
MapR
+ Founded 2009, funding $20M
+ MapR Technologies is engineering game-changing Map/Reduce related technologies
+ Products:+ M3,M5,M7+ NFS, no single node failure+ NOT open source !
35
www.mapr.com
![Page 36: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/36.jpg)
HortonWorks
+ Founded 2011
+ Yahoo spin-off
+ Products:
+ HDP distribution
+ tools
36
www.hortonworks.com
![Page 38: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/38.jpg)
Big Data Landscape38
www.bigdatalandscape.com
![Page 39: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/39.jpg)
Splunk
+ Founded 2003, raised $230M, IPO 2011, Market cap $3.35B
+ Machine logs analysis, operational intelligence
+ Collecting, searching, monitoring
39
www.splunk.com
![Page 40: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/40.jpg)
Datameer
+ Founded 2009, Funding $17,8M
+ Big data:
+ Data integration
+ Data Analytics
+ Data Visualization
+
40
www.datameer.com
![Page 41: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/41.jpg)
Datasift
+ Founded 2010, funding $29.7M
+ Data platform for social web
+ Aggregate and filter data
41
www.datasift.com
![Page 42: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/42.jpg)
Infochimps
+ Founded 2009, funding $5.5M
+ Transitioned from data marketpalce to big data platform
+ End-to-end big data solution, real time
42
www.infochimps.com
![Page 43: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/43.jpg)
Tableau software
+ Founded 2003, funding $15M
+ Big data analytics
+ Big data visualization
43
www.tableau.com
![Page 44: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/44.jpg)
Big data Startups 2012
+ Platfora, in memory BI on Hadoop
+ Sumologic, log file analysis
+ Hadapt, Hadoop+RDBSM
+ Metamarkets, patterns in data flow
+ DataStax, consulting, training
+ Karmasphere, BI, analytics on Hadoop
44
![Page 45: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/45.jpg)
Big data startups 2013!
+ 10gen, MongoDB
+ ClearStory, big data aggregation + analytics
+ Continuuity, Hadoop API
+ Parstream, database analytics
+ Zoomdata, data visualization
+ Climate corporation, predictive analytics
45
![Page 47: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/47.jpg)
Big data Processing47
Batch processing interactive stream
Query time
data volume
programming model
minutes to hours
Millisecond to seconds continues
TB to PT GB to PB continues
MapReduce Queries DAG
Users
Open Source
Developers Analysts Developers
Hadoop mapreduce Drill, Impala Storm, Kafka
![Page 48: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/48.jpg)
New technologies
+ Real time quering
+ Drill (based on Google Dremmel)
+ Impala (Cloudera)
+ Data stream processing
+ Storm (Twitter), real time analytics
+ Kafka (LinkedIn), messaging system
48
![Page 49: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/49.jpg)
Machine learning
+ Predictive analytics
+ Patterns discovery
+ Data mining
+ Tools:
+ Mahout
+ R
49
![Page 50: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/50.jpg)
Big data revolution
+ Google: GFS, MapReduce, BigTable,
+ Yahoo: Hadoop
+ Amazon: DynamoDB
+ Facebook: Cassandra, HBase
+ Twitter: FlockDB, Storm
+ LinkedIn: Vondelmort, Kafka
50
![Page 51: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/51.jpg)
Observations
+ Game changing technologies come from big companies
+ Open Source (!)
+ Start-up ecosystem
+ Less general, more specialized
+ Next step: big data analytics and visualization
51
![Page 52: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/52.jpg)
Data scientist
+ Machine Learning
+ Data Mining
+ Statistics
+ Software Engineering
+ Hadoop/MapReduce/HBase/Hive/Pig
+ Java, Python, C/C+, SQL
52
“By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
![Page 53: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/53.jpg)
Big Data Products MindMap
53
www.garycrawford.co.uk
![Page 54: Business of Big Data](https://reader038.fdocuments.us/reader038/viewer/2022102700/54c6a34e4a79598f1c8b456a/html5/thumbnails/54.jpg)
Contacts
+ Leonid Zhukov, Ph.D.
+ School of Applied Mathematics and Information Science Higher School of Economics, NRU-HSE
+ www.leonidzhukov.ru
54