1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas...
-
date post
20-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas...
![Page 1: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/1.jpg)
1
Scaling Up Classifiers to Cloud Computers
Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V.
Chawla
University of Notre Dame
![Page 2: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/2.jpg)
3
Distributed Data Mining Data Mining on Clouds Abstraction for Distributed Data Mining Implementing the Abstraction Evaluating the Abstraction Take-aways
![Page 3: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/3.jpg)
4
Distributed Data Mining
For training D, testing T, and classifier F:
Divide D into N partitions with partitioner P
Run N copies of F, one on each partition, generating a set of votes on T for each partition
Collect votes from all copies of F and combine into a final result R
![Page 4: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/4.jpg)
5
Challenges in Distributed DM
When dealing with large amounts of data (MB to GB to TB), there are systems problems in addition to data mining problems.
Why should data miners have to be distributed systems experts too?
Scalable (in terms of data size and number of resources) distributed data mining architectures tend to be finely tailored to an application and algorithm.
![Page 5: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/5.jpg)
6
Proposed Solution
An abstraction framework for distributed data mining An abstraction allows users to declare a distributed
workload based on only what they know (sequential programs, data)
Why an abstraction? Abstractions hide many complexities from users Unlike a specially-tailored implementation, a
conceptual abstraction provides a general-purpose solution for a problem which may be implemented in any of several ways depending on requirements.
![Page 6: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/6.jpg)
7
Clusters versus Cloud Computers
Small (4-16) to very large
Use shared filesystem, often centralized
Assign dedicated resources, often in large blocks
Often static and generally homogeneous
Managed by batch or grid engine
Large (~500 CPUs, ~300 disks @ ND)
Use individual disks rather than a central FS
Assign resources dynamically, without a guarantee of dedicated access
Commodity, Dynamic, and Heterogeneous
Managed by batch or grid engine
![Page 7: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/7.jpg)
8
Implementing the Abstraction
There are several factors to consider: How many nodes to
use for computation? How many nodes to
use for data. How to connect the
data and computation nodes?
![Page 8: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/8.jpg)
9
Streaming
Each process is connected via a data stream.
Data exists only in buffers in memory, and stream writers block until stream readers have consumed the buffer.
Requires full-way parallelism to complete.
Not robust to failure.
![Page 9: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/9.jpg)
10
Pull
Partitioning is done ahead of computation and partitions are stored on the source node.
Computation jobs pull in the proper partition from the source node.
Flexible and robust to failure, but not scalable to a large number of computation nodes.
![Page 10: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/10.jpg)
11
.data
P1 P2 P3
P4
Condor Matchmaker
Pull
![Page 11: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/11.jpg)
12
Push
Work assignments are done ahead of partitioning and partitioning distributes data to where it will be used.
Data are accessed locally where possible, or accessed in-place remotely.
This improves scalability to larger numbers of computation nodes, but can decrease flexibility and increase reliance on unreliable nodes.
![Page 12: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/12.jpg)
13
.data
P1
P2
P3
P4Condor Matchmaker
Push
![Page 13: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/13.jpg)
14
Hybrid
Push to a well-known set of intermediate nodes.
Pull from those nodes. This combines the
advantages of Pull (flexibility, reliability) and Push (I/O performance)
![Page 14: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/14.jpg)
15
.data P1
P2
P3
P4
Condor Matchmaker
Hybrid
![Page 15: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/15.jpg)
16
Implementing the Abstraction
The effectiveness of these possibilities hinges on the flexibility, reliability, and performance of their components.
An example of such a component is the partitioning algorithm.
![Page 16: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/16.jpg)
18
Partitioning Algorithms
Shuffle: One instance at a time from the training data, copy into a partition.
Chop: One partition at a time, copy all its instances from the training data
![Page 17: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/17.jpg)
19
BA
CDEFGHIJ
A
B
C
D
E
F
G
H
I
J
K
L
L
K
Shuffle
![Page 18: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/18.jpg)
20
BA
CDEFGHIJ
A
D
G
J
B
E
H
K
C
F
K
L
L
I
Chop
![Page 19: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/19.jpg)
21
5.4G / Locals: using fgets, fprintf. R16s: using fgets, chirp_stream_write, intra-sc0 cluster.
![Page 20: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/20.jpg)
23
![Page 21: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/21.jpg)
24
Partitioning Conclusions
Remote partitioning is faster, but less reliable, than local partitioning
Shuffle is slower locally and to a small number of remote hosts but scales better to a large number of remote hosts
Shuffle is less robust than Chop for large data sets
![Page 22: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/22.jpg)
25
Evaluating the Architectures
Evaluation is based on performance and scalability. Classifier algorithms were decision trees, K-nearest
neighbors, and support vector machines.
![Page 23: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/23.jpg)
26
Protein Data Set (3.3M instances, 170MB), Using Decision Trees
![Page 24: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/24.jpg)
27
KDDCup Data Set (4.9M instances, 700MB), Using Decision Trees
![Page 25: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/25.jpg)
28
Alpha Data Set (400K instances, 1.8GB), Using KNN
![Page 26: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/26.jpg)
29
System Architectures
Push Fastest (remote part., mainly local access, etc.) 1-to-1 matching or heavy preference.
Could have pure 1-to-1 matching, but more fragile.
Pull Slowest (local part, on-jobstart transfer) Most robust (central data, “any” host can run jobs)
Hybrid Combination: Push to subset of nodes, then Pull. Faster than Pull (remote part., multiple servers), More robust than Push (small set of servers)
![Page 27: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/27.jpg)
30
Future Work
Performance vs. Accuracy for long-tail jobs Is there a viable tradeoff between turnaround
time and degrading classification accuracy? Efficient data management on multicores Hierarchical abstraction framework
Submit jobs to clouds of subnets of multicores
![Page 28: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/28.jpg)
31
Conclusions
Hybrid method is amenable to both cluster-like environments and larger, more-diverse clouds, and its use of intermediate data servers mitigates some of shuffle’s problems.
A fundamental limit of scalability is the available memory on each workstation. For our largest sets, even 16 nodes were not sufficient to run effectively.
![Page 29: 1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d4d5503460f94a2c64f/html5/thumbnails/29.jpg)
32
Questions?
Data Analysis and Inference Laboratory Karsten Steinhaeuser ([email protected]) Nitesh V. Chawla ([email protected])
Cooperative Computing Laboratory Christopher Moretti ([email protected]) Douglas Thain ([email protected])
Acknowledgements: NSF CNS-06-43229, CCF-06-21434, CNS-07-20813