SCALABILITY AND DATA ANALYTICS MATTER
HCB (@boosc)
Agenda
• Buzzword bingo
• Data
• Analytics
• Scalability
• Distributed and parallel concepts
• Technology and tools
• Senzari and big data
Buzzword Bingo
Big DataData Engineer
H-Space
HadoopCassandra HBasePIGredis.io Eucalyptus
Machine Learning Support Vector Machines
Gaussian ProcessesSwarm Intelligence
Genetic Algorithms
Agents/Bots
R+Natural Language Processing
ClusteringCore Dataset
NoStats
Data, lots of it
79 times more CPU power than used in Apollo missions on one iPhone
What we can do
Data
Knowledge pyramid
Data Processing 1960 s 1950 s Data
Data:
Unfiltered, Research, Creation, Gathering
Knowledge pyramid
Data Processing 1960 s 1950 s Data
Information Mangement 1980 s 1970 s Information
Information:
Organized Data, Patterns, Presentation
Knowledge pyramid
Data Processing 1960 s 1950 s Data
Information Mangement 1980 s 1970 s Information
Knowledge Management 1990 s Knowledge
Knowledge:
Useful Patterns, Predictability, Conversation
Knowledge pyramid
Data Processing 1960 s 1950 s Data
Information Mangement 1980 s 1970 s Information
Knowledge Management 1990 s Knowledge
Knowledge Ecology 2000 s Intelligence
Intelligence: Choice, Understanding, Dicision
Knowledge pyramid
Data Processing 1960 s 1950 s Data
Information Mangement 1980 s 1970 s Information
Knowledge Management 1990 s Knowledge
Knowledge Ecology 2000 s Intelligence
Wisdom 2010 s Systems Thinking
Wisdom:
Evaluation, Interpretation, Retrospective
Knowledge pyramid
Data Processing 1960 s 1950 s Data
Information Mangement 1980 s 1970 s Information
Knowledge Management 1990 s Knowledge
Knowledge Ecology 2000 s Intelligence
Wisdom 2010 s Systems Thinking
Yield
Why you need big data
Data Processing 1960 s 1950 s Data
Information Mangement 1980 s 1970 s Information
Knowledge Management 1990 s Knowledge
Knowledge Ecology 2000 s Intelligence
Wisdom 2010 s Systems Thinking
Yield You Are Here !
Analytics
Even in simple datasets, common statistics fails - (avg, min, max, distribution)
Finding clusters, evaluating outliers and interpreting white noise
Two tips for looking at data:
1. Plot it
2. Remove all labels
Scalability
Cloud Computing Is
When the IT guys are finally able to explain to business
people what they were talking about 20 years ago!
=
Computation on demand
+Pay as you go
BASE(Basically Available, Soft State, Eventual consistency)
not
ACID(Atomicity, Consistency, Isolation, Durability)
How to scale (AWS Example)
• Do not allocate instances manually
• Each component needs to be independent
• Plan for failure
• Actively provoke failure
Human Software
• Click Workers and Mechanical Turks are not just cheap labour
• They allow programmers to hand tasks to humans they are not able to handle algorithmically
• Make use of it to
• Do things too complicated for machine learning
• Pre populate machine learning spaces
Distributed and parallel concepts
Imperative Programming
• Step by step explanation what to do
• Explaining WHAT to do rather than RESULTS you want
• Always necessary for basic algorithms
1
2
3
Functional Programming I
• Combine results to become a program
• Allows dynamic distribution
• Map-Reduce is only one way of doing it!
1
2
3
Functional Programming II
F ( G ( H ( A,B) , C), D)
getMusicLikes(getFriends(facebookID)
Instead of
for i in getFriends(facebookID) getMusicLikes(i)
Technology and tools
Data Storage
• Cassandra - for write performance
• Hbase - for read performance
• Redis.io - for predictable operation time
Other Data Storage
• Mongo - NOSQL for beginners (close to SQL, but scalability is very manual)
• SONOS -Graph DB (Windows based)
• CouchDB, etc. etc. - nice concepts, lots of great ideas, but communities too small
Distributed Computing
• Hadoop
• Zookeeper as DLS
Languages
• ERLANG
• HASKELL
• SCALA
• Lisp
• Prolog
• Mathmatica
STDOUT
No, You Don‘t Have to Learn ERLANG? No,Use Hadoop
Streaming With Python
Program 1
Line 1
Line 1
Line 1
Line 1
Program 2
Program 2
Program 2
Program 2
Check out my tool list:http://www.hcboos.net/100-links/
Senzari and big data
The AMP3 PlatformAdaptable Music Parallel Processing Platform
Behind AMP
Technologies
• AWS: EC2, S3, EBS, SNS, ELB
• Cassandra + Hadoop + Solandra
• Zookeeper
• Dynamic scaling server (Lich Lord)
• Asynchronous messaging system
• Modules built in python
Effects
• Built on top of python platform
• Fully automated scaling
• Fully distributed data processing
• Message channels allow code decoupling
• Message channels allow replay
• Message channels allow outtasking
Thank You for Your Time
Credits
• „Big Data Just Beginning to Explode“ by CSC http://www.csc.com/insights/flxwd/78931-big_data_just_beginning_to_explode
• „Social media network connections among twitter users“ by Marc Smith http://www.flickr.com/photos/marc_smith/
• Asteroid Datasets by Bruce Gary http://brucegary.net/POVENMIRE/x.htm
Top Related