Big data – a brief overview
-
Upload
dorai-thodla -
Category
Technology
-
view
2.384 -
download
2
description
Transcript of Big data – a brief overview
Big Data – A Brief Overview
Petabytes, Hadoop, Analytics, Collaborative business intelligence, Data scientists, In-Memory Databases, NoSQL
platforms
Big Data
• What is it?• Where does it come from?• How do we process it?• What do we do with it?• Who are the players?• What are the opportunities?
What Is Big Data?
Like the term Cloud, it is a bit Nebulous
Attributes of Big Data
• Volume• Velocity - streaming• Variety
Where Does It Come From?
It Depends
Key Drivers
Spread of cloud computing, mobile computing and social media
technologies, financial transactions
Sources of Big Data• Chatter from social networks, • Web server logs, • Traffic flow sensors, • Satellite imagery, • Broadcast audio streams, • Banking transactions, • MP3s of rock music, • The content of web pages, • Scans of government documents, • GPS trails, • Telemetry from automobiles, • Financial market data• ….
How Do We Process It?
Source: http://radar.oreilly.com
Process Pipeline
Hadoop
A distributed processing Framework based on Map/Reduce
Pig
A platform for analyzing large data sets that consists of a high-level language for expressing
data analysis programs, coupled with infrastructure for evaluating these programs.
Mahout
A machine learning library with algorithms for clustering, classification and batch based collaborative filtering that are
implemented on top of Apache Hadoop.
Hive
Data warehouse software built on top of Apache Hadoop that facilitates querying and managing large datasets residing in
distributed storage.
Pegasus
A Peta-scale graph mining system that runs in parallel, distributed manner on top of
Hadoop
Sqoop
A tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational
databases.
Flume
A distributed service for collecting, aggregating, and moving large log data
amounts to HDFS.
Yahoo S4
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows
programmers to easily develop applications for processing continuous unbounded streams of data.
Twitter Storm
Storm can be used to process a stream of new data and update
databases in real time.
Trends
Funding, Companies, Applications, Jobs, IPOs
Funding & IPO
• Cloudera, (Commerical Hadoop) more than $75 million
• MapR (Cloudera competitor) has raised more than $25 million
• 10Gen (Maker of the MongoDB) $32 million• DataStax (Products based on Apache
Cassandra) $11 million• Splunk raised about $230 million through IPO
Big Data Application Domains
• Healthcare• The public sector• Retail• Manufacturing • Personal-location data• Finance
A Few Examples
PayPal Tracking Architecture
Market and Market Segments
Research Data and Predictions
http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues
Market for big data tools will rise from $9 billion to $86 billion in 2020
http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues
Future of Big Data
• More Powerful and Expressive Tools for Analysis• Streaming Data Processing (Storm from Twitter and S4 from
Yahoo)• Rise of Data Market Places (InfoChimps, Azure Marketplace)• Development of Data Science Workflows and Tools (Chorus,
The Guardian, New York Times)• Increased Understanding of Analysis and Visualization
http://www.evolven.com/blog/big-data-predictions.html
http://www.evolven.com/blog/big-data-predictions.html
Opportunities
Skills Gap
• Statistics• Operations Research• Math• Programming• So-called "Data Hacking"