Predicting reliability of software systems under development
Software Systems Development
-
Upload
christine-witt -
Category
Documents
-
view
31 -
download
0
description
Transcript of Software Systems Development
![Page 1: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/1.jpg)
SOFTWARE SYSTEMS DEVELOPMENT
MAP-REDUCE , Hadoop, HBase
![Page 2: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/2.jpg)
The problem
Batch (offline) processing of huge data set using commodity hardware
Linear scalability
Need infrastructure to handle all the mechanics, allow for developer to focus on the processing logic/algorithms
![Page 3: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/3.jpg)
Data Sets
The New York Stock Exchange: 1 Terabyte of data per day
Facebook: 100 billion of photos, 1 Petabyte(1000 Terabytes)
Internet Archive: 2 Petabyte of data, growing by 20 Terabytes per month
Can’t put data on a single node, need distributed file system to hold it
![Page 4: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/4.jpg)
Batch processing
Single write/append multiple reads Analyze Log files for most frequent URL
Each data entry is self-contained At each step , each data entry can be
treated individually After the aggregation, each aggregated
data set can be treated individually
![Page 5: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/5.jpg)
Grid Computing
Grid computing Cluster of processing nodes attached to
shared storage through fiber (typically Storage Area Network)
Work well for computation intensive tasks, problem with huge data sets as network become a bottleneck
Programming paradigm: Low level Message Passing Interface (MPI)
![Page 6: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/6.jpg)
Hadoop
Open-source implementation of 2 key ideas HDFS: Hadoop distributed file system Map-Reduce: Programming Model
Build based on Google infrastructure (GFS, Map-Reduce papers published 2003/2004)
Java/Python/C interfaces, several projects built on top of it
![Page 7: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/7.jpg)
Approach
Limited but simple model fit to broad range of applications
Handle communications, redundancies , scheduling in the infrastructure
Move computation to data instead of moving data to computation
![Page 8: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/8.jpg)
Who is using Hadoop?
![Page 9: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/9.jpg)
Distributed File System (HDFS) Files are split into large blocks (128M,
64M) Compare with typical FS block of 512Bytes
Replicated among Data Nodes(DN) 3 copies by default
Name Node (NN) keeps track of files and pieces Single Master node
Stream-based I/O Sequential access
![Page 10: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/10.jpg)
HDFS: File Read
![Page 11: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/11.jpg)
HDFS: File Write
![Page 12: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/12.jpg)
HDFS: Data Node Distance
![Page 13: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/13.jpg)
Map Reduce
A Programming Model
Decompose a processing job into Map and Reduce stages
Developer need to provide code for Map and Reduce functions, configure the job and let Hadoop handle the rest
![Page 14: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/14.jpg)
Map-Reduce Model
![Page 15: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/15.jpg)
MAP function
Map each data entry into a pair <key, value>
Examples Map each log file entry into <URL,1> Map day stock trading record into <STOCK,
Price>
![Page 16: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/16.jpg)
Hadoop: Shuffle/Merge phase Hadoop merges(shuffles) output of the
MAP stage into <key, valulue1, value2, value3>
Examples <URL, 1 ,1 ,1 ,1 ,1 1> <STOCK, Price On day 1, Price On day 2..>
![Page 17: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/17.jpg)
Reduce function
Reduce entries produces by Hadoop merging processing into <key, value> pair
Examples Map <URL, 1,1,1> into <URL, 3> Map <Stock, 3,2,10> into <Stock, 10>
![Page 18: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/18.jpg)
Map-Reduce Flow
![Page 19: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/19.jpg)
Hadoop Infrastructure
Replicate/Distribute data among the nodes Input Output Map/Shuffle output
Schedule Processing Partition Data Assign processing nodes (PN) Move code to PN(e.g. send Map/Reduce code) Manage failures (block CRC, rerun MAP/Reduce
if necessary)
![Page 20: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/20.jpg)
Example: Trading Data Processing Input:
Historical Stock Data Records are CSV (comma separated values)
text file Each line : stock_symbol, low_price, high_price 1987-2009 data for all stocks one record per
stock per day
Output: Maximum interday delta for each stock
![Page 21: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/21.jpg)
Map Function: Part I
![Page 22: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/22.jpg)
Map Function: Part II
![Page 23: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/23.jpg)
Reduce Function
![Page 24: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/24.jpg)
Running the Job : Part I
![Page 25: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/25.jpg)
Running the Job: Part II
![Page 26: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/26.jpg)
Inside Hadoop
![Page 27: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/27.jpg)
Datastore: HBASE
Distributed Column-Oriented database on top of HDFS
Modeled after Google’s BigTable data store
Random Reads/Writes on to of sequential stream-oriented HDFS
Billions of Rows * Millions of Columns * Thousands of Versions
![Page 28: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/28.jpg)
HBASE: Logical View
Row Key Time Stamp
Column Contents
Column Family Anchor (Referred by/to)
Column “mime”
“com.cnn.www”
T9 cnnsi.com cnn.com/1
T8 my.look.ca
cnn.com/2
T6 “<html>.. “
Text/html
T5 “<html>.. “
t3 “<html>.. “
![Page 29: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/29.jpg)
Physical View
Row Key Time Stamp Column: Contents
Com.cnn.www T6 “<html>..”
T5 “<html>..”
T3 “<html>..”
Row Key Time Stamp Column Family: Anchor
Com.cnn.www T9 cnnsi.com cnn.com/1
T5 my.look.ca cnn.com/2
Row Key Time Stamp Column: mime
Com.cnn.www T6 text/html
![Page 30: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/30.jpg)
HBASE: Region Servers
Tables are split into horizontal regions Each region comprises a subset of rows
HDFS Namenode, dataNode
MapReduce JobTracker, TaskTracker
HBASE Master Server, Region Server
![Page 31: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/31.jpg)
HBASE Architecture
![Page 32: Software Systems Development](https://reader035.fdocuments.us/reader035/viewer/2022062221/56812b3e550346895d8f532b/html5/thumbnails/32.jpg)
HBASE vs RDMS
HBase tables are similar to RDBS tables with a difference Rows are sorted with a Row Key Only cells are versioned Columns can be added on the fly by client
as long as the column family they belong to preexists