Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists...
-
Upload
romeo-kienzler -
Category
Technology
-
view
533 -
download
0
Transcript of Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists...
IBM Data Warehousing Portfolio Positioning
Information Retrieval, Applied Statistics and Mathematicson BigData
Romeo KienzlerData Scientist and ArchitectIBM Innovation Center Zurich
Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,3TB SEAGATE Barracuda 7200.14< 500 EURO
100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD
MTBF ~ 365 d > 1,5 d
Supercomputer in a Rack
Supercomputer before
Weather
Atom Bombs
Science
Crash Tests
Supercomputer in a Rack
18 TB Main Memory, 1008 CPU Cores, 113 TFLOPS (1st TOP500 2013: 17590 TFLOPS 2004: 71 TFLOPS)
Hadoop / BigInsights
Hadoop Distributed File System
Hadoop Job Scheduling
Aggregated Bandwith between CPU, Main Memory and Hard Drive
1 TB (at 10 GByte/s)- 1 Node - 100 sec- 10 Nodes - 10 sec- 100 Nodes - 1 sec- 1000 Nodes - 100 msec
Watson
1 TB (at 45.5 GByte/s)- 1 Core - 22 sec- 10 Core - 2.2 sec- 100 Core - 220 msec- 1000 Core - 22 msec- 10000 Core - 2.2 msec
Data Streaming
X86 Box
X86 Blade
CellBlade
X86 Blade
FPGABlade
X86 Blade
X86 Blade
X86Blade
X86 Blade
X86Blade
Operating System
Transport
System S Data Fabric
Processing Element Container
Processing Element Container
Processing Element Container
Processing Element Container
Processing Element Container
Massive Parallel DataWarehousing
Why do we need to process so much data?
Data Growth
Data AVAILABLE to an organization
data an organization can PROCESS
Missed opportunity
100 Million Tweets are posted every day, 35 hours of video are being uploaded every minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed through the net.80 % spam and viruses. => Filtering is more and more important.
Up to 2003 the same amount of data has been produced as between 2003 and now
Separate the Signal From the Noise
http://www.ibmsystemsmag.com/power/businessstrategy/BI-and-Analytics/signal_noise/
The Unreasonable Effectiveness of Data
"sometimes it's not who has the best algorithm that wins; it's who has the most data."
(C) Google Inc.
http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
Statistical Modeling of Physical Systems
From Unstructured Data to Structured Data - Feature Extraction
Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately
: Wikipedia
Dimension Reduction
Principal Component Analysis / Singular Value DecompositionLinear Discriminant Analysis
Source: coursera.org
Data Parallelism
Data Parallelism
Calculate the empirical mean along each dimension m = 1, ..,M (step in Principal Component Analysis)
N-gram Models (NLP)
Ordinary Least-Square Parameter Estimator for Linear Regression
BUT: Do I want to care about algorithm parallelization?
High-Level Languages
Source: Hadoopsphere.comSource: Hadoopsphere.com
High-Level Languages (IBM SystemML)
Extensible LibraryLinear SVMs,Logistic RegK-meansClassificationLinear RegressionRegressionSGD solver,NMFMatrix FactorizationsClusteringPageRank,HITSRanking
Parser
High-Level Ops
Low-Level Ops
Runtime Ops
Optimizations
Hadoop
DML Scripts
Open Source Variant: Apache Mahout- less algorithms- no optimizer
High-Level Languages (RHadoop)
Source: http://www.revolutionanalytics.com
High-Level Languages (R on IBM PureData)
Source: http://www.revolutionanalytics.com
Push Back
Application
Algorithm
Compile
Engine Execution Language
Engine
Push Back
Push Back
Source: coursera.org
Linear Discriminant Analysis
Outlook
Theory: With BigData the machines are thinking for us
Reality: Existing algorithms are now beginning to be applied on a large scale basis
Presence: Every company thinks they have to urgently participate in BigData, but don't know how
Future: Every company will have access to BigData technologies and will use them
Hype: The whole world is doing BigData
Vision: BigData Analytics is usable for everybody at their fingertips
Questions?
Links
www.ibm.com/developerworks
www.ibm.com/ibm/university/academic
U6K8qm_HFas
Jqq66INlQ0U
2012 IBM Corporation
2012 IBM Corporation
August 28, 2012
2012 IBM Corporation