Big data Intro - Presentation to OCHackerz Meetup Group
-
Upload
sri-kanajan -
Category
Education
-
view
156 -
download
2
description
Transcript of Big data Intro - Presentation to OCHackerz Meetup Group
Introduction to Big Data
Sri Kanajan
Big Data• When data is too VVV (volume, variety, velocity) to manage with traditional
RDBMS, then you enter BIG DATA!• Data Storage and Manipulation, at Scale
– MapReduce, Hadoop, relationship to databases (Framework)– Key-value stores and NoSQL; tradeoffs of SQL and NoSQL (Database type)– Entity resolution, record linkage, data cleaning (data integration)
• Analytics (Machine Learning)– Basic statistical modeling, experiment design, overfitting– Supervised learning: overview, simple nearest neighbor, decision trees/forests,
regression – Unsupervised learning: k-means, multi-dimensional scaling – Graph Analytics: PageRank, community detection, recursive queries, iterative processing – Text Analytics: latent semantic analysis – Collaborative Filtering: slope-one
• Communicating Results – Visualization, data products, visual data analytics
Outline
• What is Big Data?• Why is this important now?• Key Concepts– Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization
Big Data Everywhere!
• Lots of data is being collected and warehoused – Web data, e-commerce– purchases at department/
grocery stores– Bank/Credit Card
transactions– Social Network
Unknown Hidden Relationships within this Data !!!
How much data?• Google processes 20 PB a day (2008)• Wayback Machine has 3 PB + 100 TB/month (3/2009)• Facebook has 2.5 PB of user data + 15 TB/day (4/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009)• CERN’s Large Hydron Collider (LHC) generates 15 PB a
year
640K ought to be enough for anybody.
Type of Data• Relational Data (Tables/Transaction/Legacy Data)• Unstructured Text Data – Log data, Comments, User generated text
• Semi-structured Data (XML) • Graph Data– Social Network, Semantic Web (RDF)
• Real time Data – You can only scan the data once and need to do
analytics quickly
What does Big Data Give You?• Without Big Data
– Many data warehouses that were separate and on non distributed architectures– Had to modify data structures and unique programming to merge databases
together– Scaling database size is a continual problem– Any large scale analytics took days and weeks and large coordination effort within
IT to get database accesses– Data analysis is a large effort and lots of data tend to remain unanalyzed or even
worse not stored• With Big Data
– Hadoop provides a single view of all databases that can be distributed– Database size is a non issue– Ability to perform advanced statistical analysis on very large datasets very quickly– Data analysis is the competitive edge for many companies since barriers of entry
are continually dropping through the development of platforms
Examples• Norwegian Food Safety Authority
– accumulates data on all farm animals– birth, death, movements, medication, samples, ...
• Hafslund– time series from hydroelectric dams, power prices, meters of individual
customers, ...• Social Security Administration
– data on individual cases, actions taken, outcomes...• Statoil
– massive amounts of data from oil exploration, operations, logistics, engineering, ...
• Retailers– see Target example above– also, connection between what people buy, weather forecast, logistics, ...
Big Data
Power of Distribution
45 Minutes! 4.5 Minutes!
Outline
• What is Big Data?• Why is this important now?• Key Concepts– Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization
Hadoop
• A framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model (I.e. MapReduce)– Distributed data processing– Works with structured and unstructured data– Open source– Master-slave architecture– Fault tolerant using commodity hardware
MapReduce
• Programming model on top of Hadoop• Basic concept is to provide a programming model that
immediately supports parallel processing (SQL on the other hand does not natively encourage parallel processing)
• Pig is a framework and programming language to develop MapReduce
• Note – MapReduce is great for extremely large data sets with simple relations. SQL is great for medium size data sets but with complex relationships– I.e. you have to decide the right technology depending on your
problem space
A Simple Example • Counting words in a large set of documents
map(string value)//key: document name//value: document contentsfor each word w in value
EmitIntermediate(w, “1”);
reduce(string key, iterator values)//key: word//values: list of countsint results = 0;for each v in values
result += ParseInt(v);Emit(AsString(result));
MapReduce
Outline
• What is Big Data?• Why is this important now?• Key Concepts– Hadoop, MapReduce – Storage architecture– Machine Learning – Analytics – Visualization
Machine Learning
• Essentially ways to analyze data to extract valuable information with or without training data– Prediction
• predicting a variable from data
– Classification• assigning records to predefined groups
– Clustering• splitting records into groups based on similarity
– Association learning• seeing what often appears together with what
– And many others….
Now you have an optimization metric by which you can automate the exploration of all possible hypotheses !Problems with this approach??
21
Two kinds of learning
• Supervised– we have training data with correct answers– use training data to prepare the algorithm– then apply it to data without a correct
answer• Unsupervised– no training data– throw data into the algorithm, hope it
makes some kind of sense out of the data
Example: Collaborative Filtering• Goal: predict what movies/books/… a person may be interested
in, on the basis of– Past preferences of the person– Other people with similar past preferences– The preferences of such people for a new movie/book/…
• One approach based on repeated clustering– Cluster people on the basis of preferences for movies– Then cluster movies on the basis of being liked by the same clusters of
people– Again cluster people based on their preferences for (the newly created
clusters of) movies– Repeat above till equilibrium
• Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest
22
Outline
• What is Big Data?• Why is this important now?• Key Concepts– Hadoop, MapReduce – Storage architecture– Machine Learning – Analytics – Visualization
Is this an effective visual representation?
Better Mapping? Why?
Diagrams Showing O-Ring Damage that was Used to Decide to Launch Challenger in 1987
Representation of the Same Data
Strategies to Increase the Information Encoded by Spatial Position
• Composition– Orthogonal placement of axes– Creates a 2D metric space
Strategies to Increase the Information Encoded by Spatial Position
• Alignment
Folding
• Continuation of the Axes
Recursion
Overloading
Conclusion
• Big Data is a huge field that combines expertise from different domains in order to find interesting information from data
• Extracting interesting information from data is the next competitive edge for many companies as information becomes available, instantly anywhere