SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler...
Transcript of SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler...
![Page 1: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/1.jpg)
SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCEMANAGEMENT
A Dissertation Presented
by
VIJAY SUNDARAM
Submitted to the Graduate School of theUniversity of Massachusetts Amherst in partial fulfillment
of the requirements for the degree of
DOCTOR OF PHILOSOPHY
February 2006
Computer Science
![Page 2: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/2.jpg)
c Copyright by Vijay Sundaram 2006
All Rights Reserved
![Page 3: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/3.jpg)
SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCEMANAGEMENT
A Dissertation Presented
by
VIJAY SUNDARAM
Approved as to style and content by:
Prashant Shenoy, Chair
Mark Corner, Member
C. Mani Krishna, Member
James Kurose, Member
W. Bruce Croft, Department ChairComputer Science
![Page 4: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/4.jpg)
Good design comes from experience. Experience comes from bad design.
![Page 5: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/5.jpg)
ACKNOWLEDGMENTS
The years spent in Amherst doing my PhD have been a fulfilling and illuminating expe-
rience. Many people have contributed in significant ways andhelped me see this through.
First and foremost I would like to thank my advisor ProfessorPrashant Shenoy for his
expert guidance over the years. I would like to thank membersof my thesis committee—
Professors Mark Corner, Mani Krishna and Jim Kurose. I wouldalso like thank Dr. Sumit
Roy and Dr. Pawan Goyal for collaborating with me in my research and for their excellent
mentoring during my internships at HP Labs and IBM Almaden. Also, thanks are due to
Sumit for helping me with valuable career advice.
Tyler Trafford has been most helpful with configuring the Linux cluster and the storage
testbed in the context of my research. My heartfelt thanks goto Sharon Mallory, Pauline
Hollister, Betty Hardy and Karren Sacco who made things simpler by helping eagerly with
various administrative issues.
My friends, fellow students and colleagues have played a significant role in this journey.
I would like to thank Atul Maharshi and Upendra Sharma, friends from IIT, who have kept
me laughing over the years. I would like to thank Ramesh Nallapatti for his eager help
whenever I was in need. Abhishek Chandra and Bhuvan Urgaonkar have been great friends
and labmates. Purushottam Kulkarni and Peter Desonyers have been very helpful with
practice talks and comments on paper drafts. In Rahul Gupta,Subhrangshu Nandi and
Pranesh Venugopal I found great housemates who made my stay in Amherst a pleasant
one. My sincere thanks to my friends Harpal Singh Bassalli, Swati Birla, Yu Gu, Kishore
Indukuri, Pallika Kanani, Anoop George Ninan, Hema Raghavan and Aparna Vemuri. Atul
Sheel and Rashmi Sheel provided me with a home away from home in Amherst.
v
![Page 6: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/6.jpg)
The constant encouragement, belief and support of my parents, Col. M.M. Sundaram
and Harsha Sundaram, and my brother Ajay Sundaram, has been instrumental in my achieve-
ments. Last but not the least, I thank my wife Kavita Jaswal for her support and confidence,
egging me to go on and never cave in, no matter what.
vi
![Page 7: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/7.jpg)
ABSTRACT
SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCEMANAGEMENT
FEBRUARY 2006
VIJAY SUNDARAM
B.Tech., INDIAN INSTITUTE OF TECHNOLOGY BOMBAY
M.S., UNIVERSITY OF MASSACHUSETTS AMHERST
Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST
Directed by: Professor Prashant Shenoy
The increasing reliance on online information in our daily lives had called for a rethink-
ing of how people manage and maintain computer systems. As information has become
more valuable and computing environments more complex, improved manageability has
become key to ensuring availability. The sheer size of enterprise-scale storage systems
coupled with the diversity and variability of application workloads makes their manage-
ment non-trivial. Not surprisingly, numerous studies haveshown that management costs
have become a significant fraction of the total cost of ownership of large storage systems.
Traditionally storage management tasks have been performed manually by administrators
using a combination of experience, rules of thumb and trial and error. This increases the
chance of a misconfigured or sub-optimally configured system. The cost of such miscon-
figurations can be high since even a short downtime can resultin substantial revenue losses.
vii
![Page 8: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/8.jpg)
So, although storage is cheap, storage management is costlyand storage mismanagement
costlier. This argues the need for an automated, seamless and intelligent way to manage
the storage resource.
In this thesis, I propose self-managing techniques, specifically for resource management,
to improve the manageability of large-scale storage systems. I have focused on techniques
for automating two common storage allocation tasks: storage bandwidth allocation and
storage space allocation. Large scale storage systems hostdata objects of multiple types
which are accessed by applications with diverse service requirements. I have developed an
online measurement based technique as well as one based on learning to dynamically par-
tition bandwidth between application classes. Storage allocation algorithms that determine
object placement, and thus the performance, are crucial to the success of a storage sys-
tem. For a self-managing storage system a suitable placement technique is one that has low
management overhead and delivers agreeable performance. In this context, I empirically
compare different placement techniques to determine theirsuitability for large-scale stor-
age systems. Finally, I also present techniques to minimizethe amount of data displaced
when remapping objects to eliminate hotspots.
viii
![Page 9: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/9.jpg)
TABLE OF CONTENTS
Page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .vii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . xiv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . xv
CHAPTER
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 11.2 Automating Storage Resource Management . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 31.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4
1.3.1 Initial Storage System Configuration: Placement Techniques in aSelf-managing Storage System . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 4
1.3.2 Short-term Storage System Reconfiguration: BandwidthAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 5
1.3.3 Long-term Storage System Reconfiguration: AutomatedObjectRemapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 7
1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9
2. PLACEMENT TECHNIQUES IN A SELF-MANAGING STORAGESYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . .10
2.1 Background and Problem Description . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 122.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 15
2.2.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 172.2.2 Ideal Narrow Striping versus Wide Striping . . . . . . . . .. . . . . . . . . . . . 20
2.2.2.1 Comparison using Homogeneous Workloads . . . . . . . . .. . . 20
ix
![Page 10: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/10.jpg)
2.2.2.2 Comparison using Heterogeneous Workloads . . . . . . .. . . . . 222.2.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 30
2.2.3 Impact of Inter-Stream Interference . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 312.2.4 Impact of Load Skews: Trace-driven Simulations . . . . .. . . . . . . . . . . . 322.2.5 Experiments on a Storage System Testbed . . . . . . . . . . . .. . . . . . . . . . . 36
2.2.5.1 Synthetic Workload . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 362.2.5.2 TPC-H Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 362.2.5.3 TPC-C Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 38
2.3 Summary and Implications of our Experimental Results . .. . . . . . . . . . . . . . . . 412.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 442.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 45
3. SELF-MANAGING BANDWIDTH ALLOCATION IN A MULTIMEDIAFILE SERVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 46
3.1 Self-Managing Bandwidth Allocation: Problem Definition . . . . . . . . . . . . . . . . 473.2 Self-Managing Bandwidth Allocation in a Single Disk Server . . . . . . . . . . . . . 50
3.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 503.2.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 513.2.3 Monitoring the Workload in the Two Classes . . . . . . . . . .. . . . . . . . . . . 523.2.4 Adapting the Allocation of Each Class . . . . . . . . . . . . . .. . . . . . . . . . . . 54
3.2.4.1 Estimating Bandwidth Requirement based on DiskUtilizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.4.2 Estimating Bandwidth Requirement based on theArrival Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.4.3 Computing the Reservations of Each Class . . . . . . . . .. . . . . 57
3.3 Self-Managing Bandwidth Allocation in a Multi-disk Server . . . . . . . . . . . . . . 593.4 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 62
3.4.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 623.4.2 Workload Characteristics . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 63
3.4.2.1 Best-effort Text Clients . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 633.4.2.2 Soft Real-time Video Clients . . . . . . . . . . . . . . . . . . .. . . . . . 64
3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 65
3.5.1 Ability to Adapt to Changing Workloads . . . . . . . . . . . . .. . . . . . . . . . . 653.5.2 Bandwidth Allocation in a Single-disk Server . . . . . . .. . . . . . . . . . . . . 663.5.3 Bandwidth Allocation in a Multi-disk Server . . . . . . . .. . . . . . . . . . . . . 68
x
![Page 11: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/11.jpg)
3.5.4 Impact of Tunable Parameters . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 703.5.5 Comparison with Static Allocation . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 71
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 743.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 74
4. LEARNING-BASED APPROACH FOR DYNAMIC BANDWIDTHALLOCATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 75
4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 75
4.1.1 Background and System Model . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 754.1.2 Key Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 764.1.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 78
4.2 A Learning-based Approach . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 78
4.2.1 Reinforcement Learning Background . . . . . . . . . . . . . . .. . . . . . . . . . . . 794.2.2 System State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 804.2.3 Allocation Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 804.2.4 Cost and State Action Values . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 814.2.5 A Simple Learning-based Approach . . . . . . . . . . . . . . . . .. . . . . . . . . . . 824.2.6 An Enhanced Learning-based Approach . . . . . . . . . . . . . .. . . . . . . . . . . 84
4.3 Implementation in Linux . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 894.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 90
4.4.1 Simulation Methodology and Workload . . . . . . . . . . . . . .. . . . . . . . . . . 904.4.2 Effectiveness of Dynamic Bandwidth Allocation . . . . .. . . . . . . . . . . . . 914.4.3 Comparison with Alternative Approaches . . . . . . . . . . .. . . . . . . . . . . . 934.4.4 Effect of Tunable Parameters . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 964.4.5 Implementation Experiments . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 974.4.6 Implementation Overheads . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 99
4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 1004.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 102
5. AUTOMATED OBJECT REMAPPING FOR LOAD BALANCINGLARGE SCALE STORAGE SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 103
5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 1035.1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 105
5.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 107
xi
![Page 12: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/12.jpg)
5.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 1075.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 107
5.3 Object Remapping Techniques . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 110
5.3.1 Cost Oblivious Object Remapping . . . . . . . . . . . . . . . . . .. . . . . . . . . . 111
5.3.1.1 Randomized Packing . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 1115.3.1.2 BSR-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 112
5.3.2 Cost-aware Object Remapping . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 113
5.3.2.1 Randomized Object Reassignment . . . . . . . . . . . . . . . .. . . . 1135.3.2.2 Displace and Swap . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 114
5.4 Measuring Bandwidth Requirements and Detecting Hotspots . . . . . . . . . . . . . 1225.5 Implementation Considerations . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 1245.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 126
5.6.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 126
5.6.1.1 Impact of System Size . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 1285.6.1.2 Impact of System Bandwidth Utilization . . . . . . . . . .. . . . . 1305.6.1.3 Impact of System Space Utilization . . . . . . . . . . . . . .. . . . . 1315.6.1.4 Impact of Optimizations . . . . . . . . . . . . . . . . . . . . . . .. . . . . 132
5.6.2 Prototype Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 134
5.6.2.1 Uniform Object Size . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 1355.6.2.2 Variable Object Size . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 1385.6.2.3 Implementation Overheads . . . . . . . . . . . . . . . . . . . . .. . . . . 139
5.6.3 Summary of Experimental Results . . . . . . . . . . . . . . . . . .. . . . . . . . . . 140
5.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 1415.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 142
6. SUMMARY AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143
6.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 144
6.1.1 Initial Storage System Configuration: Placement Techniques in aSelf-managing Storage System . . . . . . . . . . . . . . . . . . . . . . . . .. . . 144
6.1.2 Short-term Storage System Reconfiguration: BandwidthAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 145
6.1.3 Long-term Storage System Reconfiguration: AutomatedObjectRemapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 147
xii
![Page 13: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/13.jpg)
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 148
APPENDIX: COMPARISON USING HOMOGENEOUS WORKLOADS . . . . . .150
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 155
xiii
![Page 14: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/14.jpg)
LIST OF TABLES
Table Page
2.1 Characteristics of the Fujitsu Disk . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 16
2.2 Summary of the Traces. IOPS denotes the number of I/O operations persecond. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 19
2.3 TPC-C and Sequential Workload Throughput in Narrow and WideStriping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 41
3.1 Characteristics of the Auspex NFS trace . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 63
3.2 Characteristics of Video traces . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 65
xiv
![Page 15: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/15.jpg)
LIST OF FIGURES
Figure Page
2.1 Narrow and wide striping in an enterprise storage system. . . . . . . . . . . . . . . . . . 12
2.2 Effect of system size for homogeneous closed-loop workloads. Systemsize of 1 depicts narrow striping. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 21
2.3 Homogeneous Workload: Closed-loop Testbed Experiments . . . . . . . . . . . . . . . 22
2.4 Effect of system size for heterogeneous Poisson workloads. System sizeof 1 depicts narrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 24
2.5 Effect of varying the stripe unit size of large requests.System size of 1depicts narrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 25
2.6 Effect of the inter-arrival times of large requests. System size of 1 depictsnarrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 26
2.7 Effect of inter-arrival times of small requests. Systemsize of 1 depictsnarrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 27
2.8 Effect of request size of large requests. System size of 1depicts narrowstriping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 28
2.9 Effect of percentage of large write requests. System size of 1 depictsnarrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 29
2.10 Effect of percentage of small write requests. System size of 1 depictsnarrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 30
2.11 Impact of inter-stream interference. System size of 1 depicts narrowstriping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 32
2.12 Trace Driven Simulations . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 34
2.13 Trace Driven Simulations with Load Imbalance . . . . . . . .. . . . . . . . . . . . . . . . . 35
xv
![Page 16: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/16.jpg)
2.14 Heterogeneous Workload: Closed-loop Testbed Experiments . . . . . . . . . . . . . . 37
2.15 Comparison using the TPC-H Benchmark . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 39
3.1 Three techniques for supporting multiple application classes at a fileserver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 47
3.2 A Moving Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 52
3.3 Parameters tracked by the monitoring module . . . . . . . . . .. . . . . . . . . . . . . . . . . 54
3.4 Bursty nature of the NFS trace workload. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 64
3.5 Adaptive allocation of disk bandwidth . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 67
3.6 Bandwidth allocation in a single-disk server. . . . . . . . .. . . . . . . . . . . . . . . . . . . . 67
3.7 Bandwidth allocation in a multi-disk server. . . . . . . . . .. . . . . . . . . . . . . . . . . . . 69
3.8 Effect of various tunable parameters on the granularityof bandwidthallocations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 71
3.9 Comparison with Static Partitioning . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 73
4.1 Relationship between application classes, logical volumes and logical units.. . . . . . . 77
4.2 Discretizing the State Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Steps involved in learning . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 84
4.4 Algorithm flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 85
4.5 Behavior of the learning-based dynamic bandwidth allocationtechnique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 92
4.6 Comparison with Alternative Approaches . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 93
4.7 Impact of Tunable Parameters . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 95
4.8 Results from our prototype implementation. . . . . . . . . . .. . . . . . . . . . . . . . . . . . 99
4.9 Memory overheads of the bandwidth allocator. . . . . . . . . .. . . . . . . . . . . . . . . . 101
5.1 System model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 108
xvi
![Page 17: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/17.jpg)
5.2 Illustration of Displace and Swap. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 122
5.3 Impact of system size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 129
5.4 Impact of bandwidth utilization. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 130
5.5 Impact of space utilization. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 132
5.6 Impact of optimizations. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 133
5.7 Uniform object size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 136
5.8 Variable object size; no spare storage space . . . . . . . . . .. . . . . . . . . . . . . . . . . . 138
5.9 Impact on application performance. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 140
A.1 Homogeneous Workload: Effect of System Size . . . . . . . . . .. . . . . . . . . . . . . . 151
A.2 Homogeneous Workload: Effect of Stripe-unit Size . . . . .. . . . . . . . . . . . . . . . 152
A.3 Homogeneous Workload: Effect of Utilization Level . . . .. . . . . . . . . . . . . . . . 152
A.4 Homogeneous Workload: Effect of Request Size . . . . . . . . .. . . . . . . . . . . . . . 153
A.5 Homogeneous Workload: Effect of Percentage of Writes . .. . . . . . . . . . . . . . . 154
xvii
![Page 18: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/18.jpg)
CHAPTER 1
INTRODUCTION
1.1 Motivation
Enterprise-scale storage systems are complex systems consisting of tens or hundreds of
storage devices. Due to the sheer size of these systems coupled with the complexity of the
application workloads that access them, storage systems are becoming increasingly diffi-
cult to design, configure and manage. Storage system management comprises of a slew of
administration tasks ranging from how much and what storageto buy, to how to map stor-
age objects to disk arrays. Moreover, reconfiguration and tuning is required on a continual
basis to deal with changes in workload or incremental growth. Not surprisingly numerous
studies have shown that management costs far outstrip equipment costs and have become a
the dominant fraction (75-90%) of the total cost of ownership of large computing systems
[46, 6, 36]. Overprovisioning in such large-scale systems to alleviate the management com-
plexity can be expensive and may not be compensating even in awell-configured system,
especially so since diverse workloads and changing requirements mute the notion of one
flawless configuration.
Traditionally, storage management tasks have been performed manually by administra-
tors who use a combination of experience, rules of thumb and trial and error. This increases
the chances of misconfigured or sub-optimally configured system. In an age where infor-
mation is increasingly available online, the cost of such misconfigurations can be high
since even a short down-time can result in substantial revenue losses. So,although storage
is cheap, storage management is costly and storage mismanagement costlier still. In fact, it
has been argued that the problems of maintainability, availability and growth of computing
1
![Page 19: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/19.jpg)
systems have overshadowed that of performance and that the traditional focus on perfor-
mance is less important in today’s environments [18, 30]. These arguments motivate the
need for an automated, seamless and intelligent way to manage the storage resource.
Although high level planning decisions do require human involvement, tasks such as
storage resource allocation are amenable tosoftware automationakin to aself-managing
systemwhich executes important operations without the need for human intervention. The
primary research challenge is to ensure that the system provides performance that is com-
parable to a human-managed system, but at a lower cost.
How often a management task needs to be instantiated dependson a number of factors:� The specifics of the task. What is the inherent nature of the task?� The initial configuration. Is the system ill-configured or well-configured?� Changing workload patterns. What is the time-scale over which workload changes?
Whereas some management tasks require attention on the short-term, say over a period
of hours to days, others need to be dealt with only over longertime periods ranging from
months to years. For example, backups could be carried out ona daily basis on critical
data and maybe required only once a week on less important data. Adding new and faster
storage devices to the storage system may be required less often, ranging over periods of
months to years. So, we see that the same management task may require attention possibly
over multiple time scales.
The initial configuration of the system may also play a role inhow often is adminis-
tration required. If the system has been configured with an eye to future growth and/or
the workload requirements etc., as is necessary, it may easethe task of the administrator.
However, an ill-configured system, where for example logical volumes run out of space on
frequent intervals, or heavily accessed logical volumes have been collocated on a storage
device resulting in hotspots and performance degradation,may require frequent reconfigu-
ration.
2
![Page 20: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/20.jpg)
Changing workloads may also effect an immediate reconfiguration. For example, if we
see a sudden increase in workload for a class of applicationswhich do not have sufficient
bandwidth resources to absorb the burst in workload, the system resources may need to be
reallocated; the time-period of this task again depends on the burstiness of the workload.
Finally, an interplay of these factors may also guide the time-period of the management
task. For example, changing workload patterns in an ill-configured system versus a well-
configured system may require widely different amounts of reconfiguration efforts.
A well-designed self-managing technique should take into account all of these factors,
together with the associated anomalies, and trigger the requisite reconfiguration as and
when necessary.
1.2 Automating Storage Resource Management
In the previous section, we argued that automating storage management is crucial and
a multitude of factors make the task of automating storage management challenging. The
impetus for automating storage management came from [15, 8]. Autonomic computing is
another term often used to refer to the notion of self-managing computing systems. Such
a system is self-configuring, self-optimizing, self-healing and self-securing. In the context
of computing systems the high-level goal is to improve system management and reliability.
The eventual goal is to have a system that does not need anyoneto manage and maintain it
once it has been installed. In this thesis, we focus on the storage management component
of autonomic computing.
Storage management tasks can broadly be classified into three categories: initial con-
figuration, short-term reconfiguration and long-term reconfiguration. Initial configuration
refers to the task of initial storage system configuration; tasks such as object placement,
RAID-level tagging, configuring the network connectivity in a storage system etc., fall into
this category. Short-term reconfiguration refers to tasks which require attention on a contin-
ual basis and include bandwidth allocation between application classes, extending logical
3
![Page 21: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/21.jpg)
volumes etc. Finally, long-term reconfiguration refers to tasks which need to be invoked
when a short-term reconfiguration is insufficient to ensure acceptable storage system per-
formance. These include migrating the system to new devices, data migration to remove
long-term workload hotspots etc.
In this thesis, we consider problems in each category. In particular, the problems ad-
dressed are from the perspective of resource management. Resource management in a
storage systems aims at ensuring that the storage system gives agreeable performance and
the storage resources are used efficiently. The storage resource comprises two components:
the storage space and storage bandwidth. We gave developed techniques for automating the
allocation of both resources1.
1.3 Thesis Contributions
In this section, we elaborate on the contributions of the thesis, and discuss challenges in-
volved in automating storage resource management. We classify these contributions based
on the time scale of the management task.
1.3.1 Initial Storage System Configuration: Placement Techniques in a Self-managing
Storage System
The first step in storage management is deciding on a mapping of storage objects to disk
arrays. Object placement decisions are integral in determining application performance and
thus are crucial to the success of a storage system. For a self-managing storage system a
suitable placement technique is one that has low managementoverhead and delivers agree-
able performance.
Object placement techniques are based onstriping—a technique that interleaves the
placement of objects onto disks—and can be classified into two different categories: nar-
1Note, that here by storage space allocation we refer to spaceallocation at the granularity of logicalvolumes. Space internal to a logical volume is managed by a file system or a database manager as may be thecase.
4
![Page 22: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/22.jpg)
row and wide striping. From the perspective ofmanagement complexity, these two tech-
niques have fundamentally different implications. Whereas wide stripingstripes each ob-
ject across all the disks in the system and needs very little workload information for making
placement decisions,narrow stripingtechniques stripe an object across a subset of the disks
and employs detailed workload information to optimize the placement.
In this work, we perform a systematic study of the tradeoffs of narrow and wide striping
to determine their suitability for large-scale storage systems. The work involved (i) simu-
lations driven by OLTP traces and synthetic workloads, and (ii) experiments on a 40 disk
storage system testbed.
The results show that an idealized narrow striped system canoutperform a comparable
wide-striped system for small requests. However, wide striping outperforms narrow striped
systems in the presence of workload skews that occur in real I/O workloads; the two sys-
tems perform comparably for a variety of other real-world scenarios. The experiments indi-
cate that the additional workload information needed by narrow placement techniques may
not necessarily translate to significantly better performance, and more specifically does not
outweigh the benefits ofmanagement simplicityinnate to a wide-striped system.
1.3.2 Short-term Storage System Reconfiguration: Bandwidth Allocation
In the context of dynamic bandwidth allocation we develop two techniques, one a meas-
urement-based inference technique and another based on learning.
Self-managing Bandwidth Allocation in a Multimedia File Server:
Large scale storage systems host data objects of multiple types which are accessed
by applications with diverse service requirements. For instance, a multimedia file server
services a heterogeneous mix ofsoft-real timestreaming media and traditionalbest-effort
requests. To provide QoS to both application types, employing a reservation-based ap-
proach, where the storage space is shared but a certain fraction of the bandwidth is re-
served for each class, has certain advantages. By sharing storage resources, the file server
5
![Page 23: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/23.jpg)
can extractstatistical multiplexinggains; by reserving bandwidth, it can prevent interfer-
ence among classes and meet the performance guarantees of the soft-real time class. Thus,
a reservation-based approach has inherent advantages and flexibility which make it suitable
for a large-scale storage system.
Dynamic workload variations, as seen in modern file servers,may mean that one set
of reservations may not be suitable all the time. To address this limitation, in this thesis,
we develop techniques for self-managing bandwidth allocation in a multimedia file server.
In our scheme, we use online measurements to infer bandwidthrequirements and guide
allocation decisions. A workload monitoring module tracksseveral parameters represen-
tative of the load within each class using amoving histogram. It tracks various aspects of
resource usage from the time a request arrives to the time it is serviced by the disk. Mon-
itored parameters include request arrival rates, request waiting times and disk utilizations
within each class.
Requests within the best-effort class desire low average response times, while those
within the real-time class have associated deadlines that must be met. We instrument an
existing disk scheduling algorithm which takes into account these disparate performance
requirement specifications while enforcing allocations and making scheduling decisions.
A simulation study using NFS file-server traces as well as synthetic workloads demon-
strates that our techniques (i) provide control over the time-scale of allocation via tunable
parameters, (ii) have stable behavior during overload, and(iii) provide significant advan-
tages over static bandwidth allocation.
Learning-based Approach for Dynamic Bandwidth Allocation:
An alternative to a measurement-based inference techniquefor bandwidth allocation
is reinforcement learning. An advantage of usingreinforcementlearning is that no prior
training of the system is required; the technique allows thesystem to learn online. More-
over, a learning based approach can also handle complexnon-linearityin system behavior.
6
![Page 24: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/24.jpg)
In this problem, we assume multiple application classes each of which specify their QoS
requirement in the form of an average response time goal.
A simple learning approach is one that systematically triesout all possible allocations
for each system state, computes a cost function and stores these values to guide future
allocations. Although such a scheme is simple to design and implement, it has prohibitive
memory and search space requirements; this is because the number of possible allocations
increases exponentially with increase in the number of classes.
A key contribution of our work is the design of anenhancedlearning based approach
that uses the semantics of the problem to overcome the drawbacks of the naive learning
approach. The technique takes the current system state intoaccount while making alloca-
tion decisions and thereby avoids allocations that are clearly inappropriate for a particular
state; in other words, the optimized technique intelligently guides and restricts the alloca-
tion space explored. The design decisions result in substantial reduction in memory and
search space requirements, making a practical implementation feasible.
We implement these techniques in theLinux kernel. and use the software RAID driver
in Linux to configure the disk array. The results show that (i)the use of learning enables
the storage system to reduce the impact of QoS violations by over a factor of two, and (ii)
the implementation overheads of employing such techniquesin operating system kernels is
small.
1.3.3 Long-term Storage System Reconfiguration: AutomatedObject Remapping
Suitable initial placement obviates the need for frequent reconfiguration. And auto-
mated bandwidth allocation, which uses controlled requestthrottling, helps extract good
performance from the system in the face of transient workload changes. Persistent work-
load changes, which stress the storage system and result inhotspots, would deem it neces-
sary that the mapping of storage objects to arrays be tuned toensure agreeable performance.
7
![Page 25: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/25.jpg)
Moving the system to a new configuration involves executing amigration plan, which
is a sequence of object moves. The reconfiguration itself could be carried out eitheronline
or offline. In both cases, the scale of the reconfiguration i.e., the amount of data that needs
to be displaced, is of consequence. While for an offline reconfiguration the scale of the
reconfiguration determines the duration of the reconfiguration and hence the downtime, for
an online reconfiguration it determines the duration of performance impact on foreground
applications. Existing approaches do not optimize for the scale of the reconfiguration,
possibly moving much more data than required to remove the hotspot.
To address this limitation, we develop algorithms to minimize the amount of data dis-
placed during a reconfiguration to remove hotspots in large-scale storage systems. Rather
than identifying a new configuration from scratch, which mayentail significant data move-
ment, our novel approach uses the current object configuration as ahint; the goal being
to retain most of the objects in place and thus limit the scaleof the reconfiguration. To
minimize the amount of data that needs to be moved we use agreedyapproach that uses
thebandwidth to space ratio(BSR) as a guiding metric. For example, by greedily select-
ing high BSR objects for reassignment, one can displace morebandwidth per unit of data
moved. Finally, we use various optimizations, including searching for multiple solutions,
to counter some of the pitfalls in a greedy approach.
We evaluate our techniques using a combination of simulation studies and an evalua-
tion of an implementation in the Linux kernel. Results from the simulation study suggest
that for a variety of system configurations our novel approach reduces the amount of data
moved to remove the hotspot by a factor of two as compared to other approaches. The
gains increase for a larger system size and magnitude of overload. Experimental results
from the prototype evaluation suggest that our measurementtechniques correctly identify
workload hotspots. For some simple overload configurationsconsidered in the prototype
our approach identifies a load-balanced configuration whichminimizes the amount of data
8
![Page 26: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/26.jpg)
moved. Moreover, the kernel enhancements do not result in any noticeable degradation in
application performance.
1.4 Dissertation Outline
We now present a brief outline of the dissertation. In Chapter 2, we present an eval-
uation of object placement techniques in storage systems. In Chapter 3, we consider the
problem of self-managing bandwidth allocation in the context of a multimedia file server.
Chapter 4 discusses a learning based approach for dynamic bandwidth allocation to meet
the QoS requirements of multiple application classes. In Chapter 5, we address the problem
of automated object remapping to load balance large scale storage systems. We conclude
with a brief summary of the research contributions and future research directions in Chap-
ter 6.
9
![Page 27: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/27.jpg)
CHAPTER 2
PLACEMENT TECHNIQUES IN A SELF-MANAGING STORAGESYSTEM
In Chapter 1 we argued that an ill-configured storage system can be detrimental for ap-
plication performance and may increase the burden on the administrator. A well-configured
system on the other hand would obviate the need for frequent reconfiguration. So, the initial
configuration of a storage system is particularly significant.
In this chapter, we focus on the initial configuration task ofobject placement. Stor-
age allocation algorithms that determine object placement, and thus the performance, are
crucial to the success of a storage system. For a self-managing storage system a suitable
placement technique is one that has low management overheadand delivers agreeable per-
formance.
Object placement techniques for large storage systems havebeen extensively studied in
the last decade, most notably in the context of disks arrays such as RAID [9, 20, 21, 26, 50].
Most of these approaches are based onstriping—a technique that interleaves the placement
of objects onto disks—and can be classified into two fundamentally different categories.
Techniques in the first category require a priori knowledge of the workload and use either
analytical or empirically derived models to determine an optimal placement of objects onto
the storage system [9, 20, 50]. An optimal placement is one that balances the load across
disks, minimizes the response time of individual requests and maximizes the throughput
of the system. Since requests accessing independent storescan interfere with one another,
these placement techniques often employ narrow striping—where each object is interleaved
across a subset of the disks in the storage system—to minimize such interference and pro-
vide isolation. An alternate approach is to assume that detailed knowledge of the workload
10
![Page 28: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/28.jpg)
is difficult to obtain a priori and to use wide striping—whereall objects are interleaved
across all disks in the storage system. The premise behind these techniques is that stor-
age workloads vary at multiple time-scales and often in an unpredictable fashion, making
the task of characterizing these workloads complex. In the absence of precise knowledge,
striping all objects across all disks yields good load balancing properties. A potential lim-
itation though is the interference between independent requests that access the same set of
disks.
Although narrow striping is advocated both by the research literature and widely used
in practice, at least one major database vendor has recentlyadvocated the use of wide
striping to simplify storage administration [1, 38]. However, no systematic study of the
two techniques exists in the literature.
From the perspective ofmanagement complexity, these two techniques have fundamen-
tally different implications. A storage system that employs narrow striping will require
each allocation request to specify detailed workload parameters so that the system can
determine an optimal placement for the allocated store. In contrast, systems employing
wide striping will require little, if any, knowledge about the workload for making storage
allocation decisions. Thus, wide striped systems are easier to design and use, while narrow-
striped systems can potentially make better storage decisions. This results in a simplicity
versus performance tradeoff—wide striped systems advocate simplicity by requiring less
workload information, which can potentially result in worse performance. The opposite
is true for a narrow striped system. Narrow striping can extract performance gains only
if the workload specification is precise. It is nota priori evident if narrow striping can
make better storage decisions when the workload specification is imprecise or incorrect
(the accuracy of the workload information is not an issue in wide striping, since no such
information is required for placement decisions). Although placement of objects in large
storage systems have been extensively studied [7, 9, 10, 50], surprisingly, no systematic
11
![Page 29: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/29.jpg)
���� ����
������������
�����
���
�� �� ��
����
������������
�� �� �� �� �� �� �� ��
����������������
����������������
��
��������������������
����������������
����������������
����������������
����
����
����
�����
���������������
����
����
����
����Store 2
Store 1
Array 1 Array 2 Array N
Store K
Array 1
Store 1Store 2
Store 3
Store K
Disk
Array NArray 2
Store 3
Narrow Striping
Wide Striping
Figure 2.1.Narrow and wide striping in an enterprise storage system.
study of these tradeoffs of wide and narrow striping exists in the literature. Our work seeks
to address this issue by answering the following questions:� Is narrow or wide striping better suited for large scale storage systems? Specifically,
does the additional workload information required by narrow striping translate into
significant performance gains?� From a performance standpoint, how do narrow and wide striping compare against
one another? What is the impact of interference between requests accessing the same
set of disks in wide striping? Similarly, what is the impact of imprecise workload
knowledge and the resulting load skews in narrow striping?
2.1 Background and Problem Description
An enterprise storage system consists of a large number of disk arrays. A disk array is
essentially a collection of physical disks that presents anabstraction of one or more logical
disks to the rest of the system. Disk arrays map objects onto disks by interleaving data
from each object (e.g., a file) onto successive disks in a round-robin manner—a process
12
![Page 30: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/30.jpg)
referred to asstriping. The unit of interleaving, referred to as astripe unit, denotes the
maximum amount of logically contiguous data stored on a single disk; the number of disks
across which each data object is striped is referred to as itsstripe width. As a result of
striping, each read or write request potentially accesses multiple disks, which enables ap-
plications to extract I/O parallelism across disks, and to an extent, prevents hot spots by
dispersing the application load across multiple disks. Disk arrays can also provide fault
tolerance by guarding against data loss due to a disk failure. Depending on the the exact
fault tolerance technique employed, disk arrays are classified into different RAID levels
[45]. A RAID level 0 (or simply, RAID-0) array is non-redundant and can not tolerate disk
failures; it does, however, employ striping to enhance I/O throughput. A RAID-1 array
employs mirroring, where data on a disk is replicated n another disk for fault-tolerance
purposes. A RAID-1+0 array combines mirroring with striping, essentially by mirroring
an entire RAID-0 array. A RAID-5 array uses parity blocks forredundancy—each parity
block guards a certain number of data blocks—and distributes parity blocks across disks in
the array.
With the above background consider a storage system that consists of a certain number
of RAID arrays. In general, RAID arrays in large storage systems may be heterogeneous—
they may consist of different number of disks and may be configured using different RAID
levels. For simplicity we assume that all arrays in the system are homogeneous. The pri-
mary goal in such systems is to allocate storage to applications such that their performance
needs are met. The storage is allocated on one or more arrays and is referred as astore[7];
the data on a store is collectively referred to as adata object(e.g., a tablespace, a file sys-
tem). The sequence of requests accessing a store is referredto as arequest stream. Thus,
we are concerned with the storage allocation problem at the granularity of stores and data
objects; we are less concerned about how each application manages its allocated store to
map individual data items such as files and database tables todisks.
13
![Page 31: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/31.jpg)
We need to make two decisions when allocating a store to a dataobject: (1)RAID level
selection:The RAID level chosen for the store depends on the fault-tolerance needs of the
application and the workload characteristics. From the workload perspective, RAID-1+0
(mirroring combined with striping) may be appropriate for workloads with small writes,
while RAID-5 is appropriate for workloads with large writes.1 (2) Mapping of stores onto
arrays: One can map each store onto one or more disk arrays. If narrow striping is em-
ployed, each store is mapped onto a single array (and the dataobject is striped across disks
in that array). Alternatively, one may construct a store by logically concatenating storage
from multiple disk arrays and stripe the object across thesearrays (a logical volume man-
ager can be used to construct such a store). In the extreme case where wide striping is used,
each store spansall arrays in the system and the corresponding data object is striped across
all arrays (Figure 2.1 pictorially depicts narrow and wide striping). Since the RAID-level
selection problem has been studied in the literature [11, 61], we focus only on the problem
of mapping stores onto arrays.
The choice of narrow or wide striping for mapping stores ontoarrays results in different
tradeoffs. Wide striping can result in interference when streams accessing different stores
have correlated access patterns. Such interference occurswhen a request arrives at the disk
array and sees requests accessing other stores queued up at the array; this increases queuing
delays and can affect store throughput. Observe that, such interference is possible even in
narrow striping when multiple stores are mapped onto a single array. However, one can
reduce the impact of interference in narrow striping by mapping stores with anti-correlated
access patterns on to a single array. The effectiveness of such optimizations depends on the
degree to which the workload can be characterized preciselyat storage allocation time, and
the degree to which request streams are actually anti-correlated. No such optimizations can
be performed in wide striping, since all stores are mapped onto all arrays. An orthogonal
1Small writes in RAID-5 require a read-modify-write process, making them inefficient. In contrast, large(full stripe) writes are efficient since no reads are necessary prior to a write.
14
![Page 32: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/32.jpg)
issue is the inability of wide striping to exploit the sequentiality of I/O requests. In wide
striping, sequential requests from an application get mapped to data blocks on consecutive
arrays. Consequently, sequentiality at the application level is not preserved at the storage
system level. In contrast, large sequential accesses in narrow striped systems result in
sequential block accesses at the disk level, enabling thesearrays to reduce disk overhead
and improve throughput.
Despite the above advantages, a potential limitation of narrow striping is its suscepti-
bility to load imbalances. Recall that, narrow striping requires a priori information about
the application workload to map stores onto arrays such thatthe arrays are load-balanced.
In the event that the actual workload deviates from the expected workload, load imbalances
will result in the system. Such load skews may require reorganization of data across arrays
to re-balance the load, which can be expensive. In contrast,wide striping is more resilient
to load imbalances, since all stores are striped across all arrays, causing load increases to
be dispersed across arrays in the system.
Finally, narrow and wide striping require varying amounts of information to be spec-
ified at storage allocation time. In particular, narrow striping requires detailed workload
information for load balancing purposes and to minimize interference from overlapping
request streams. In contrast, wide striping requires only minimal workload information to
determine parameters such as the stripe unit size and the RAID level.
The objective of our study is to quantify the above tradeoffsand to determine the suit-
ability of narrow and wide striping for large storage systems.
2.2 Experimental Evaluation
We evaluate the tradeoffs of narrow and wide striping using simulations and experi-
ments on a storage system testbed.
Our storage system simulator simulates a system with multiple RAID-5 arrays; each
RAID-5 array is assumed to consist of five disks (four disks and a parity disk, referred to as
15
![Page 33: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/33.jpg)
Minimum Seek 0.6 msAverage Seek 4.7 msMaximum Seek 11.0 msRotational Latency 5.98 msRotational speed 10,000 RPMMaximum Transfer Rate 39.16 MB/s
Table 2.1.Characteristics of the Fujitsu Disk
a 4+p configuration). The data layout in RAID-5 arrays is left-symmetric. Each disk in the
system is modeled as an 18 GB Fujitsu MAJ3182MC disk; the characteristics of this disk
are shown in Table 2.1. The disk head movement is modeled as in[21]. We also incorporate
a write-back LRU cache to capture the effect of the storage controller cache. The cache size
is varied linearly with the number of arrays in the storage system, with 64 MB of cache per
array. The cache also employs an early destage policy to evict dirty buffers.
Our storage system testbed consists of a IBM TotalStorage FAStT-700 storage subsys-
tem equipped with 40 18 GB disks. The storage subsystem is connected to a 1.6 GHz Pen-
tium 4 server with 512 MB RAM running Linux 2.4.18 over Fibre Channel. The specific
RAID configurations used in our experiments are described inthe corresponding experi-
mental sections.
Depending on whether narrow or wide striping is used, each object (and the correspond-
ing store) is either placed on a single array or striped across all arrays in the system. We
assume each store is allocated a contiguous amount of space on each disk. Each data object
in the system is accessed by a request stream; a request stream is essentially an aggregation
of requests sent by different applications to the same store. For example, a request stream
for an OLTP application is the aggregation of I/O requests triggered by various transac-
tions. We use a combination of synthetic and trace-driven workloads to generate request
streams in our simulations; the characteristics of these workloads are described in the next
section.
16
![Page 34: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/34.jpg)
2.2.1 Experimental Methodology
Recall that, narrow striping algorithms optimize storage system throughput by (i) collo-
cating objects that are not accessed together, (i.e., collocating objects with low or zero ac-
cess correlation so as to reduce interference), and (ii) balancing the load on various arrays.
The actual system performance depends on the degree to whichthe system can exploit each
dimension. Consequently, we compare narrow and wide striping by systematically study-
ing each dimension—we first varythe interference between request streams and then the
load imbalance.
Our baseline experiment compares a perfect narrow striped system with the correspond-
ing wide striped system. In case of narrow striping, we assume that all arrays are load
balanced (have the same average load) and that there is no interference between streams
accessing an array. However, these streams will interfere when wide striped and our ex-
periment quantifies the resulting performance degradation. Observe that, the difference
between narrow and wide striping in this case represents theupper boundon the perfor-
mance gains that can be accrued due to intelligent narrow striping. Our experiment also
quantifies how this bound varies with system parameters suchas request rates, request size,
system size, stripe unit size, and the fraction of read and write requests.
Next, we compare a narrow striped system with varying degrees of interference to a
wide striped system with the same workload. To introduce interference in narrow striping,
we assume that each array stores two independent objects. Wekeep the arrays load bal-
anced and vary the degree of correlation between streams accessing the two objects (and
thereby introduce varying amounts of interference). We compare this system to a wide
striped system that sees an identical workload. The objective of our experiment is to quan-
tify the performance gains due to narrow striping, if any, inthe presence of inter-stream
interference. Note that, narrow striped systems will encounter such interference in prac-
tice, since (i) it is difficult to find perfectly anti-correlated streams when collocating stores,
17
![Page 35: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/35.jpg)
or (ii) imprecise workload information at storage allocation time may result in inter-stream
interference at run-time.
We then study the impact of load imbalances on the relative performance of wide and
narrow striping. Specifically, we consider a narrow stripedsystem where the load on arrays
is balanced using the the average load of each stream. We thenstudy how dynamic varia-
tions in the workload can cause load skews even when the arrays are load balanced based
on the mean load. We also study the effectiveness of wide striping in countering such load
skews due to its ability to disperse load across all arrays inthe system.
Our final set of experiments compare the performance of narrow and wide striping using
two well-known database benchmarks—TPC-C and TPC-H. We also study the effects of
interference and load variations on the two systems.
Together, these scenarios enable us to quantify the tradeoffs of the two approaches along
various dimensions. We now discuss the characteristics of the workloads used in our study.
Workload characteristics: We use a combination of synthetic workloads, real-world
traces and database benchmarks to generate the request streams in our study. Whereas
trace workloads are useful for understanding the behavior of wide and narrow striping in
real-world scenarios, synthetic workloads allow us to systematically explore the param-
eter space and quantify the behavior of the two techniques over a wide range of system
parameters. Database benchmarks, on the other hand, allow for comparisons based on
“standardized” workloads. Consequently, we use a combination of these workloads for our
study.
Our synthetic workloads are generated using two types of processes: (1)Poisson ON-
OFF process:The on and off periods of such a process are exponentially distributed. Re-
quest arrivals during the ON period are assumed to be Poisson. Successive requests are
assumed to access random locations on the store. The use of anON-OFF process al-
lows us to carefully control the amount of interference between streams. Two streams are
anti-correlated when they have mutually exclusive ON periods; they are perfectly corre-
18
![Page 36: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/36.jpg)
Name Read req. Write req. Mean req. Requestrate rate size Streams
(IOPS) (IOPS) (bytes/req)OLTP 1 28.27 93.79 3466 24OLTP 2 74.31 15.93 2449 19
Web Search 1 334.91 0.07 15509 6Web Search 2 297.48 0.06 15432 6Web Search 3 188.01 0.06 15772 6
Table 2.2.Summary of the Traces. IOPS denotes the number of I/O operations per second.
lated when their ON periods are synchronized. The degree of correlation can be varied
by varying the amount of overlap in the ON periods of streams.(2) Closed-loop process:
A closed-loop process withconcurrencyN consists ofN concurrent clients that issue re-
quests continuously, i.e., each client issues a new requestas soon as the previous request
completes. The request sizes are assumed to be exponentially distributed and successive
requests access random locations on the store.
Both the Poisson ON-OFF and closed-loop processes can generate two types of request
streams—those that issue small requests and those that issue large requests. Streams with
large requests are representative of decision support systems (DSS), while those with small
requests represent OLTP applications. Since DSS workloadsaccess large amounts of data,
we assume a mean request size of 1MB for large requests. On theother hand, since OLTP
applications generate small requests, we use 4KB for small requests; the request sizes are
assumed to be exponentially distributed. Prior studies have used similar parameters [7].
The stripe unit sizes of the stores being accessed by large and small requests was set to be
512KB and 4KB, respectively.
We also use a collection of block-level I/O trace workloads for our study; these com-
prise i) traces of I/O workloads from OLTP applications of two large financial institutions
and have different mixes of read and write requests, and ii) traces from a popular web
search engine and consists of mostly read requests. the characteristics of these traces are
19
![Page 37: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/37.jpg)
listed in Table 2.2. Traces labeled OLTP 1 and OLTP 2 are I/O workloads from OLTP
applications of two large financial institutions and have different mixes of read and write
requests. Traces labeled Web Search 1 through 3 are I/O traces from a popular web search
engine and consists of mostly read requests. Thus, the traces represent different storage
environments and, as shown in Table 2.2, have different characteristics.
2.2.2 Ideal Narrow Striping versus Wide Striping
We first compare a load-balanced, interference-free narrowstriped system with a wide
striped system using homogeneous and heterogeneous workloads. In case of homogeneous
workload, all streams generate requests of similar sizes. In case of heterogeneous workload,
streams generate requests of different sizes.
2.2.2.1 Comparison using Homogeneous Workloads
We compare narrow and wide striping, first for small request sizes and then for large
requests. Our simulations assume that each narrow striped array consists of a single store
(and a single request stream), while all stores are striped across all arrays in wide striping.
We use closed-loop workloads to generate requests streams;the concurrency factor for each
large and small closed-loop workload was assumed to be 2 and 4, respectively. We vary
the number of arrays in the system, i.e., the system size, andmeasure the average response
time in the two systems. Figures 2.2(a) and 2.2(b) depict theresponse times for large and
small request sizes, respectively, in the two systems. Whenthe system size is 1 (i.e., a
single array accessed by a single stream), narrow and wide striping are identical. Further,
since each request stream accesses a different array in narrow striping, the system size
has no impact on the response time. In other words, the performance of narrow striping is
represented by the system size of 1 (and remains unchanged).In contrast, the response time
of wide striping degrades with increasing system sizes. This is primarily due to increased
interference between request streams. However, as shown inFigure 2.2, the impact of such
20
![Page 38: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/38.jpg)
0
5
10
15
20
25
30
35
40
45
1 2 3 4 5 6 7 8 9 10
Mea
n R
espo
nse
Tim
e (
ms)
System Size (# of arrays)
Large Requests
Homogeneous Workload
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10
Mea
n R
espo
nse
Tim
e (
ms)
System Size (# of arrays)
Small Requests
Homogeneous Workload
(a) Large Requests (b) Small Requests
Figure 2.2. Effect of system size for homogeneous closed-loop workloads. System size of1 depicts narrow striping.
interference increases slowly with system size. Overall, we find that wide striping sees
response times that are 10-20% worse than narrow striping.
We validate the results of the above simulation experiment using our FAStT-700 storage
testbed. We configure the FAStT with two RAID-5 arrays (4+p configuration). We create
two stores, each 2 GB in size, on the storage system. For largerequests, the stripe unit
size of the store is 256 KB and for small requests it is configured to be 8 KB. We used the
Linux Logical Volume Manager(LVM) for wide striping the stores. The mean request size
for large and small requests is chosen to be 512 KB and 8 KB, respectively. We compare
narrow and wide striping using closed-loop workloads with different concurrency factors
(see Figure 2.3). As Figure 2.3 demonstrates, the response time in the wide-striped system
is about10 � 15% higher than in the narrow-striped system which is consistent with the
results of our simulations.
In addition to the above experiments, we compared narrow andwide striping by varying
a variety of system parameters such as the stripe unit size, the request size, the utilization
level, and the percentage of write requests. Our experiments were carried out for both
closed-loop and the open-loop Poisson OF-OFF workloads. Ineach case, we found that, if
21
![Page 39: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/39.jpg)
��������������������
���������������������������������������������������������������������������������������������������������������������������������������������������������
���������������������������������������������������������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
0
5
10
15
20
25
30
0 1 2 3
Mea
n R
espo
nse
Tim
e (m
s)
Client Concurrency Factor
Large Requests
NarrowWide �����
���������������
������������������������������������������������������������������������������������������������������������������������������������������
������������������������������������������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������������������
0
2
4
6
8
10
12
1 2 3 4 5
Mea
n R
espo
nse
Tim
e (m
s)
Client Concurrency Factor
Small Requests
NarrowWide
(a) Large Requests (b) Small Requests
Figure 2.3. Homogeneous Workload: Closed-loop Testbed Experiments
the stripe unit size is chosen carefully, the performance ofnarrow and wide striped systems
is comparable and within10 � 15% of one another. To avoid repetition, we present the
results from these experiments in the Appendix.
2.2.2.2 Comparison using Heterogeneous Workloads
To introduce heterogeneity into the system, we assume that each narrow striped array
consists of two stores, one accessed by large requests and the other by small requests (we
denote these request streams asLi andSi, respectively, wherei denotes theith array in the
system). In case of narrow striping, we ensure that only one of these streams is active at any
given time. This is achieved by assuming thatLi andSi are anti-correlated (have mutually
exclusive ON periods). We do not assume any correlations between streams accessing
independent arrays (i.e, between streamsLi andLj or Si andSj). Consequently, like
before, the narrow striped system is load-balanced and freeof inter-stream interference.
22
![Page 40: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/40.jpg)
The wide striped system, however, sees a heterogeneous workload due the simultaneous
presence of small and large requests.
We use Poisson On-Off processes to understand the effect of various parameters such
as system size, stripe unit size, utilization level, etc. Asbefore, we assume a mean request
size of 1MB for large requests and 4KB for small requests. Thedefault stripe unit size is
chosen to be 512KB and 4KB for the corresponding stores. Unless specified otherwise,
we chose request rates that yield utilization of around 60-65%; this corresponds to a mean
inter-arrival (IA) time of 17 ms for large requests and 4 ms for small requests, respectively.
Effect of System Size:We vary the number of arrays in the system from 1 to 10 and
measure the average response time of the requests for wide and narrow striping. Since each
array is independent in narrow striping, the system size hasno impact on the performance
of an individual array. Hence, like before, the performanceof narrow striping is represented
by a system size of 1 (and remains unchanged, regardless of the system size). Figures 2.4(a)
and 2.4(b) show the response times on large and small requests, respectively, for varying
system sizes. The figure shows that while large requests see comparable response times in
wide striping, small requests see worse performance. To understand this behavior, we note
that two counteracting effects come into play in a wide-striped system. First, since stores
span all arrays, there is better load balancing across arrays, yielding smaller response time.
Second, requests see additional queues that they would not have seen in a narrow striped
system, which increases the response time. This is because wide-striped streams access
all arrays and interfere with one another. Hence, a small request might see another large
request ahead of it, or a large request might see another large request from an independent
stream, neither of which can happen in a narrow striped system. Our experiment shows that,
for large requests, as one increases the system size, the benefits of better load-balancing
balance the slight degradation due to the interference; this is primarily due to the large size
of the requests. For small requests, the interference effect dominates (since a large request
can substantially slow down a small request), leading to a higher response time in wide
23
![Page 41: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/41.jpg)
0
5
10
15
20
25
30
35
40
45
1 2 3 4 5 6 7 8 9 10
Mea
n R
espo
nse
Tim
e (
ms)
System Size (# of arrays)
Large Requests
Heterogeneous Workload
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10
Mea
n R
espo
nse
Tim
e (
ms)
System Size (# of arrays)
Small Requests
Heterogeneous Workload
(a) Large Requests (b) Small Requests
Figure 2.4. Effect of system size for heterogeneous Poisson workloads.System size of 1depicts narrow striping.
striping. Observe that response time is higher by approximately the transfer time of a stripe
unit of large request.
Effect of Stripe Unit Size: In this experiment, we evaluate the impact of the stripe unit
size in wide and narrow striping. We vary the stripe unit sizefrom 64KB to 2MB for large
requests, and fix the stripe unit size for small requests at 4KB. Since the stripe-unit size of
small requests did not have much impact on performance, we omit these results.
First, consider the impact of varying the large stripe unit size on large requests (see
Figure 2.5(a)). When the large stripe unit is 64KB, a requestof 1MB size causes an average
of 16 blocks to be accessed per request. In case of narrow striping, since each stream is
striped across a 4+p array, multiple blocks are accessed from each disk by a request. Since
these blocks are stored contiguously, the disks benefit fromsequential accesses to large
chunks of data, which reduces disk overheads. In wide striping, each 1MB request accesses
a larger number of disks, which reduces the number of sequential accesses on a disk and
also increases the queue interference for both large and small requests. Consequently,
narrow striping outperforms wide striping for a 64KB stripeunit size. As we increase the
stripe unit size to 512 KB and beyond, the impact of loss in sequential access goes down.
24
![Page 42: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/42.jpg)
0
20
40
60
80
100
120
140
160
64 128 256 512 1024 2048
Mea
n R
espo
nse
Tim
e (m
s)
Stripe-unit Size (KB)
Large Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
0
10
20
30
40
50
60
64 128 256 512 1024 2048
Mea
n R
espo
nse
Tim
e (m
s)
Stripe-unit Size (KB)
Small Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
(a) Large Requests (b) Small Requests
Figure 2.5.Effect of varying the stripe unit size of large requests. System size of 1 depictsnarrow striping.
This coupled with the larger number of disk heads that are available for each request in
wide striping leads to better performance for wide striping. Since stripe unit is not varied
for small requests, it is impacted mainly by the utilizationlevels resulting from the different
stripe unit choices for large requests (see Figure 2.5(b)).For small requests, due to the
interference from large requests, wide striping leads to higher response time. Since disk
overhead, and consequently utilization, is higher in wide striping at smaller stripe unit sizes,
small requests see worse response times. As stripe unit sizeis increased the disk overhead
decreases and hence the relative response time performanceof wide striping improves.
But, beyond 512 KB, the transfer times of the large stripe units becomes significant, and
the response times of the small requests increases in wide striping.
Effect of the Utilization Level: In this experiment, we study the impact of the utiliza-
tion level on the response times of wide and narrow striping.We vary the utilization level
by varying the inter-arrival (IA) times of requests. We firstvary the IA times of large re-
quests from 11ms to 20ms with the IA time of small requests fixed at 4ms (see Figure 2.6).
We then vary the IA times of small requests from 2ms to 7ms withthe IA time for large
25
![Page 43: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/43.jpg)
0
20
40
60
80
100
11 12 13 14 15 16 17 18 19 20
Mea
n R
espo
nse
Tim
e (m
s)
Mean Inter-arrival Time (ms)
Large Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
0
5
10
15
20
25
30
35
40
11 12 13 14 15 16 17 18 19 20
Mea
n R
espo
nse
Tim
e (m
s)
Mean Inter-arrival Time (ms)
Small Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
(a) Large Requests (b) Small Requests
Figure 2.6. Effect of the inter-arrival times of large requests. Systemsize of 1 depictsnarrow striping.
requests fixed at 17ms (see Figure 2.7). The various combinations of inter-arrival times and
background loads result in utilizations ranging from 40% to80%.
Figure 2.6(a) shows that, for large requests, wide stripingoutperforms narrow striping
at high utilization levels and has slightly worse performance at low utilization levels. This is
because, at higher utilization levels, the effects of striping across a larger number of arrays
dominate the effects of interference, yielding better response times in wide striping (i.e., the
larger number of arrays yield better statistical multiplexing gains and better load balancing
in wide striping). Small requests, on the other hand, see uniformly worse performance due
to the interference from large requests (see Figure 2.6(b)). The interference decreases at
lower request rates and reduces the performance gap betweenthe two systems.
The behavior is reversed when we vary the IA time of small requests (see Figure 2.7).
At low inter-arrival times, large requests see maximum interference from small requests,
and wide striping yields worse response times as a result. Asthe IA time is increased, the
interference decreases, and the load balancing effect dominates leading to better response
time in wide striping. For small requests, the response timedifference in narrow and wide
striping is always in the range of the transfer time for one stripe unit of a large request.
26
![Page 44: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/44.jpg)
0
10
20
30
40
50
60
70
80
2 3 4 5 6 7
Mea
n R
espo
nse
Tim
e (m
s)
Mean Inter-arrival Time (ms)
Large Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
0
5
10
15
20
25
30
35
2 3 4 5 6 7
Mea
n R
espo
nse
Tim
e (m
s)
Mean Inter-arrival Time (ms)
Small Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
(a) Large Requests (b) Small Requests
Figure 2.7. Effect of inter-arrival times of small requests. System size of 1 depicts narrowstriping.
Effect of Request Size:In this experiment, we study the impact of the request size of
large requests on the performance of wide and narrow striping. Varying the request size of
small requests (in the range 2KB-16KB) did not have much impact, so we omit the results.
We vary the average request size for large requests from 64KBto 2 MB (see Figure 2.8).
The stripe unit size was chosen to be half the average requestsize for large requests; the
average request size as well as the stripe unit size was fixed at 4 KB for small requests.
Figure 2.8(a) demonstrates that for large streams, initially (i.e., at small request sizes),
queue interference results in slightly higher (approximately average seek time) response
time in wide-striping. However, as the request size increases, the utilization increases and
wide-striping leads to lower response times due to better load balancing. On the other
hand, for small requests, wide-striping leads to larger response times, and the performance
difference increases as we increase the large request size due to the increased transfer times
of the large requests.
Effect of Writes: The above experiments have focused solely on read requests.In this
experiment, we study the impact of write requests by varyingthe fraction of write requests
in the workload. We vary the fraction of write requests from 10% to 90% and measure
27
![Page 45: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/45.jpg)
0
50
100
150
200
250
16 32 64 128 256 512 1024 2048
Mea
n R
espo
nse
Tim
e (m
s)
Mean Request Size (KB)
Large Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
0
5
10
15
20
25
30
35
40
45
50
16 32 64 128 256 512 1024 2048
Mea
n R
espo
nse
Tim
e (m
s)
Mean Request Size (KB)
Small Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
(a) Large Requests (b) Small Requests
Figure 2.8. Effect of request size of large requests. System size of 1 depicts narrowstriping.
their impact on the response times in the wide and narrow striped systems. Recall that we
simulate a write-back LRU cache.
We first vary the percentage of writes of the large requests with the small requests set to
be read only (see Figure 2.9). Due to the write-back nature ofthe cache, all write requests
return immediately after updating the cache. Consequently, the response times of write re-
quests is identical in both narrow and wide striping. Hence,the overall response times (for
both reads and writes) is governed mostly by read response times and the relative fraction
of reads. In general, increasing the percentage of write requests increases the background
load due to dirty cache flushes as well as the effective utilization (since the parity block also
needs to be updated on a write2). Both of these factors interfere with read requests. For
large requests, the increased interference is offset by thebetter load dispersion capability
of wide striping, causing wide striping to outperform narrow striping—this performance
advantage improves for larger system sizes (see Figure 2.9(a)). For small requests, on the
2Instead of reading the rest of the parity group, an intelligent array controller can read just the data block(s)being overwritten and the parity block to reconstruct the parity for the remaining data blocks. We assume thatthe array dynamically chooses between a read-modify write and this reconstruction write strategy dependingon which of the two requires fewer reads.
28
![Page 46: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/46.jpg)
0
20
40
60
80
100
120
140
160
180
200
0 10 20 30 40 50 60 70 80 90
Mea
n R
espo
nse
Tim
e (m
s)
Percentage of Write Requests
Large Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80 90
Mea
n R
espo
nse
Tim
e (m
s)
Percentage of Write Requests
Small Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
(a) Large Requests (b) Small Requests
Figure 2.9. Effect of percentage of large write requests. System size of1 depicts narrowstriping.
other hand, the interference effect dominates at low utilization, causing wide striping to
yield worse response times (see Figure 2.9(b)). As the percentage of writes is increased
beyond 70%, wide striping outperforms narrow striping. This is because the interference
from background cache flushes and parity updates becomes dominant in write-intensive
workloads, and wide striping yields better load balancing properties in the presence of
such interference.
Next we vary the percentage of writes for the small requests (see Figure 2.10). The
large request streams issue only read requests. As we increase the percentage of small
write requests we see that large requests see queue interference in a wide-striped system;
consequently narrow striping gives better performance. For small requests, as the write
percentage is increased and the utilization goes up, the role of load balancing becomes
significant and the performance of wide-striping improves,giving comparable performance
at high write percentages.
29
![Page 47: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/47.jpg)
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60 70 80 90
Mea
n R
espo
nse
Tim
e (m
s)
Percentage of Writes
Large Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70 80 90
Mea
n R
espo
nse
Tim
e (m
s)
Percentage of Writes
Small Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
(a) Large Requests (b) Small Requests
Figure 2.10.Effect of percentage of small write requests. System size of1 depicts narrowstriping.
2.2.2.3 Summary
The above experiments compared a load-balanced, interference-free narrow striped sys-
tem with a wide striped system using homogeneous and heterogeneous workloads. Our ex-
periments demonstrate that, in the case of homogeneous workloads, narrow striping yields10� 15% better response times for some scenarios, while the two systems yield compara-
ble performance for most other scenarios. In case of heterogeneous workloads, our exper-
iments demonstrated that if the stripe unit size is chosen appropriately, then wide striping
yieldsbetterresponse times for large requests inmostscenarios. In some cases, wide strip-
ing yields higher response times (in the range of an average seek time). For small requests,
on the other hand, wide striping yields worse performance inmost scenarios (the perfor-
mance difference is in range of transfer time of a large stripe unit). In general, we find that
as utilization increases (for instance, by increasing write percentage), wide-striping leads
to better performance.
30
![Page 48: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/48.jpg)
2.2.3 Impact of Inter-Stream Interference
While our experiments thus far have assumed an ideal (interference-free, load-balanced)
narrow-striped system, in practice, storage systems are neither perfectly load balanced nor
interference-free. In this section, we examine the impact of one of these dimensions—
inter-stream interference—on the performance of narrow and wide striping.
To introduce interference systematically into the system,we assume a narrow striped
system with two request streams,Li andSi, on each array. Each stream is an ON-OFF
Poisson process and we vary the amount of overlap in the ON periods of each(Li; Si) pair.
Doing so introduces different amounts of correlations (andinterference) into the workload
accessing each array. Initially, streams accessing different arrays are assumed to be uncor-
related (thus,Si andSj as well asLi andLj are uncorrelated for alli; j.). Like before, all
streams access all arrays in wide striping. We vary the correlation between each(Li; Si)pair from 0 to 1 and measure its impact on the response times oflarge and small requests
(correlation of 0 implies thatLi andSi are never on simultaneously while 1 implies that
they are always on simultaneously). We control the correlation by varying the overlap frac-
tion i.e., mean time for which the two streams are ON simultaneously. For simplicity we
assume that the correlated streams have the same ON periods;also we assume the OFF pe-
riod to have the same duration as the ON period. This gives us ahigh degree of control on
stream correlations. For a correlation ofx, 0 � x � 0:5, the overlap fraction is uniformly
distributed between0 and2*x. For correlations between0.5 and1, the overlap fraction is
uniformly distributed between2*x-1 and1.0.
Figure 2.11 plots the impact of correlation on response timein narrow and wide striped
systems. As the figure demonstrates, the performance of wide-striping improves with in-
crease in correlation, with wide-striping performing better for both small and large request
sizes for correlation values higher than0:25. Observe that as correlation increases, the
probability of temporary load-imbalance in the narrow striped system increases. Since
31
![Page 49: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/49.jpg)
0
50
100
150
200
0 0.2 0.4 0.6 0.8 1
Mea
n R
espo
nse
Tim
e (m
s)
Mean Overlap Fraction (Correlation)
Large Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
0
20
40
60
80
100
120
140
0 0.2 0.4 0.6 0.8 1
Mea
n R
espo
nse
Tim
e (m
s)
Mean Overlap Fraction (Correlation)
Small Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
(a) Large Requests (b) Small Requests
Figure 2.11.Impact of inter-stream interference. System size of 1 depicts narrow striping.
wide-striping yields better load-balancing, it leads to better performance as correlation in-
creases.
2.2.4 Impact of Load Skews: Trace-driven Simulations
We use the trace workloads listed in Table 2.2 to evaluate theimpact of load imbalance
on the performance of narrow and wide striping. The traces have a mix of read and write
I/Os and small and large I/Os. To illustrate, the OLTP-1 trace has a large fraction of small
writes (mean request size is 2.5KB), while the Web-Search-1trace consists of large reads
(mean request size is 15.5KB). Our simulation setup is same as the previous sections, ex-
cept that each request stream is driven by traces instead of asynthetic ON-OFF process.
Due to the high percentage of writes in the OLTP streams, a cache of sufficient size, re-
sulted in similar performance for both narrow and wide striping, when in the write back
mode; so in the following, we have the cache in the write through mode.
To compare the performance of narrow and wide-striping using these traces, we sepa-
rate each independent stream from the trace (each stream consists of all requests accessing
a volume). This pre-processing step yields 61 streams. We then eliminate 9 streams from
the search engine traces, since these collectively contained less that 1000 requests (and are
32
![Page 50: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/50.jpg)
effectively inactive). We further eliminate 4 streams fromthe OLTP traces as these were
found to be capacity bound. We then partition the remaining 48 streams into four sets such
that each set is load-balanced. We use the write-weighted average IOPS3 of each stream as
our load balancing metric—in this metric, each write request is counted as four I/Os (since
each write could, in the worst case, trigger a read-modify-write operation involving four
I/O operations). Since the size of each I/O operation is relatively small, we did not consider
stream bandwidth as a criteria for load balancing.
We employ a greedy algorithm for partitioning the streams. The algorithm creates a
random permutation of the streams and assigns them to partitions one at a time, so that
each stream is mapped to the partition that results in the least imbalance (the imbalance is
defined to the difference between the load of the most heavily-loaded and the least lightly-
loaded partitions). We repeat the process (by starting withanother random permutation)
until we find a partitioning that yields an imbalance of less than 1%.
Assuming a system with four RAID-5 arrays, each configured with 4+p disks, we map
each partition to an array in narrow striping. All partitions are striped across all four arrays
in wide striping. We computed the average response time as well as the95th percentile of
the response times for each stream in two systems. Figure 2.12 plots the average response
time and the95th percentile of the response time, for the various streams (the X axis is
the stream id). As shown in the figure, wide striping yields average response times that
are comparable to that of a narrow striped system. Figure 2.12(c) shows the mean disk
utilizations for the disks in the system (the X axis is the disk id). Observe that variance in
the mean disk utilizations across the disks in the system is lower in a wide-striped system
due to better load balancing. Also, even for the case of narrow striping the variance in
disk utilizations is low since the partitions are load balanced (a partition comprises of five
consecutive disks).
3I/O Operations Per Second
33
![Page 51: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/51.jpg)
0
50
100
150
200
250
300
350
0 5 10 15 20 25 30 35 40 45 50
Mea
n R
espo
nse
Tim
e (
ms)
Stream Id
Mean Response Time
NarrowWide
(a) Mean Response Time
0
200
400
600
800
1000
1200
0 5 10 15 20 25 30 35 40 45 50
Mea
n R
espo
nse
Tim
e (
ms)
Stream Id
95th Percentile of Response Time
NarrowWide
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16 18
Mea
n D
isk
Util
izat
ion
Disk No.
Disk Utilizations
NarrowWide
(b) 95th Percentile of the Response Time (c) Mean Disk Utilizations
Figure 2.12.Trace Driven Simulations
34
![Page 52: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/52.jpg)
0
50
100
150
200
250
300
350
0 5 10 15 20 25 30 35 40 45 50
Mea
n R
espo
nse
Tim
e (
ms)
Stream Id
Mean Response Time
NarrowWide
(a) Mean Response Time
0
200
400
600
800
1000
1200
0 5 10 15 20 25 30 35 40 45 50
Mea
n R
espo
nse
Tim
e (
ms)
Stream Id
95th Percentile of Response Time
NarrowWide
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16 18
Mea
n D
isk
Util
izat
ion
Disk No.
Disk Utilizations
NarrowWide
(b) 95th Percentile of the Response Time (c) Mean Disk Utilizations
Figure 2.13.Trace Driven Simulations with Load Imbalance
Next we introduce load imbalance across the partitions and compare the performance
of narrow and wide striping. To introduce imbalance we simply scale the inter-arrival times
for all streams on the first two partitions by a factor of 0.75 (streams 0-25). In the narrow
striped system this increases the load on the first two partitions (disks 0-9) and the load
on the other two partitions remains unchanged; for a wide striped system however the load
across all the partitions goes up. Figure 2.13 plots the results. As can be seen in Figure 2.13
(c) the mean utilization across the first two partitions (first ten disks) has gone up in narrow
striping, and the utilization across all the partitions hadgone up for wide striping (compare
with Figure 2.12 (c)). A look at the plots for the average response time as well as the plot
35
![Page 53: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/53.jpg)
for the95th percentile of the response time shows that wide striping outperforms narrow
striping for streams on the first two partitions, and their performance on the remaining
two partitions is similar. Thus we see that, due to better load balancing, wide striping
outperforms narrow striping in the presence of load imbalances.
2.2.5 Experiments on a Storage System Testbed
In this section, we compare the performance of narrow and wide striping on our storage
testbed using a synthetic workload and two database benchmark workloads—TPC-C and
TPC-H. system. For the synthetic and TPC-H workloads, we usea FAStT-700 storage
subsystem, and for the TPC-C workload, we use a SSA-based RAID subsystem.
2.2.5.1 Synthetic Workload
The workload consists of two closed-loop streams, one largeand one small, accessing
two independent stores on a RAID-5 array simultaneously. Each store was of size 2GB
and was created on a4 + p RAID-5 array on FAStT. For large requests the stripe unit
size of the store was 256 KB and for small requests it was configured to be 8 KB. We
used theLinux Logical Volume Manager(LVM) for wide striping the stores. The mean
request size for large and small requests was chosen to be 512KB and 8 KB, respectively.
Figure 2.14 shows the response time performance of narrow and wide-striping, for various
combinations of concurrency factors of the clients accessing the large and small stores,
respectively. As the experiments demonstrate, the performance of wide striping is within10� 15% of the narrow striped system.
2.2.5.2 TPC-H Workload
TPC-H is a decision support benchmark. It was used in [5] to illustrate the benefit of
narrow striping. We use a setup similar to the one in [5] with IBM DB2 UDB instead of MS
SQL server. We setup the TPC-H database on a 1.6 GHz Pentium 4 with 512 MB RAM
running Linux 2.4.18. This was connected to the FAStT-700 storage system using Fibre
36
![Page 54: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/54.jpg)
��������������������
��������������������������������������������������������������������������������������������������
��������������������������������������������������������������������������������������������������
������������������������������������������������������������������������������������������������������������
������������������������������������������������������������������������������������������������������������
������������������������������������������������������������������������������������������������������������������������
������������������������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������������������������
0
5
10
15
20
25
30
35
40
(1,2) (1,3) (1,4) (2,2)
Mea
n R
espo
nse
Tim
e (
ms)
Client Concurrency Factor (large, small)
Large Requests
NarrowWide �����
���������������
��������������������������������������������������������������������������������������
��������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������
���������������������������������������������������������������������������������������������������������������������������������������������������
���������������������������������������������������������������������������������������������������������������������������������������������������
��������������������������������������������������������������������������������������������������������������������
��������������������������������������������������������������������������������������������������������������������
0
5
10
15
20
25
(1,2) (1,3) (1,4) (2,2)
Mea
n R
espo
nse
Tim
e (m
s)
Client Concurrency Factor (large, small)
Small Requests
NarrowWide
(a) Large Requests (b) Small Requests
Figure 2.14.Heterogeneous Workload: Closed-loop Testbed Experiments
37
![Page 55: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/55.jpg)
Channel. The page size and the extent size for the database were chosen to be 4 KB and 32
KB, respectively. The scale factor for the database was set to 1 (a 1 GB database).
For narrow striping, we used the placement described in [5].The tablelineitemwas
spread uniformly over 5 disk drives,orders was spread uniformly over three other disk
drives, and all the other tables and indexes (including the indexes for the tableslineitem
andorder) were placed in a third logical volume (calledrest) which was striped across all
the 8 disk drives. In the wide-striped case the tableslineitemandorderswere striped across
all the 8 disk drives, as was the logical volumerest. In both cases the system temporary
tables were placed on a ninth disk drive, also on the FAStT-700. The stripe unit size was
chosen to be 32 KB in all cases.
Figure 2.15(a) shows the query execution times for narrow and wide striping for a single
stream run (power run) of TPC-H. Since this is an unaudited run; the query execution times
are normalized. As the figure demonstrates, most of the queries have similar execution
times. Only for queries 20 and 21 do we see a5 � 10% performance difference; narrow
striping outperforms wide striping for query 20, and vice versa for query 21.
Figures 2.15(b) and 2.15(c) plot the I/O profile for thelineitem, orders, andrestvolumes
for narrow and wide striping, respectively. The figure demonstrates that the I/O profile are
indeed very similar in both cases. It also demonstrates thatlineitem and orders are indeed
the two important tables and the narrow placement algorithmsuggested in [5] appear to
be valid for DB2 as well. Overall, we find that when the tables are carefully mapped to
arrays in narrow striping, the two systems perform comparably (note that, no placement
optimizations are necessary for wide striping).
2.2.5.3 TPC-C Workload
Our final experiment involves a comparison of narrow and widestriping using TPC-
C workload. Our testbed consists of a four-processor IBM RS6000 machine with 512
MB RAM and AIX 4.3.3. The machine contains a SSA RAID adapter card with two
38
![Page 56: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/56.jpg)
20
40
60
80
100
0 2 4 6 8 10 12 14 16 18 20 22
Nor
mal
ized
Que
ry T
imes
Query ID
Query Times
NarrowWide
(a) Query Times
0
20
40
60
80
100
0 50 100 150 200 250 300
Nor
mal
ized
I/O
Rat
e
Time (secs)
Narrow Striping
LineitemOrders
Rest
0
20
40
60
80
100
0 50 100 150 200 250 300
Nor
mal
ized
I/O
Rat
e
Time (secs)
Wide Striping
LineitemOrders
Rest
(b) Narrow Striping: I/O Profile (c) Wide Striping: I/O Profile
Figure 2.15.Comparison using the TPC-H Benchmark
channels (also called SSA loops) and 16 9GB disks on each channel (total of 32 disks).
We configured four RAID-5 arrays, two arrays per channel, each in a7 + p configuration.
Whereas two of these arrays are used for our narrow striping experiment, the other two are
used for wide striping (thus, each experiment uses 16 disks in the system). The SSA RAID
card uses a stripe unit size of 64 KB on all arrays; the value ischosen by the array controller
and can not be changed by the system. However, as explained below, we use large requests
to emulate the behavior of larger stripe unit sizes in the system. We use two workloads in
our experiments:
39
![Page 57: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/57.jpg)
� TPC-C benchmark: The TPC-C benchmark is an On-Line Transaction Processing
(OLTP) benchmark and results in mostly small size random I/Os. The benchmark
consists of a mix of reads and writes (approximately two-thirds reads and one-third
writes [7]). We use a TPC-C setup with300 warehouses and30 clients.� Large sequential: This is an application that reads a raw volume sequentially using
large requests. The process has an I/O loop that issues requests using theread()system call. Since we can not control the array stripe unit size, we emulate the effect
of large stripe units by issuing large read requests. We use two request sizes in our
experiments. To emulate a 128KB stripe unit size, we issue 896KB requests (since64KB � 7disks = 448KB , a 898 KB request will access two 64 KB chunks on
each disk). We also find experimentally that the throughput of the array is maximized
when requests of448KB � 16 = 7 MB are issued in a singleread call. Hence, we
use7MB as the second request size (which effectively requests 16 64KB blocks
from each disk).
We first experiment with a narrow striped system by running the TPC-C benchmark on
one array and the sequential application on the other array.We find the TPC-C through-
put to beN TpmC (the exact number withheld since this is an unaudited run), while the
throughput of the sequential application is25:43MB/s for 896 KB requests and29:45 MB/s
for 7MB requests (see Table 2.3).
We then experiment with a wide striped system. To do so, we create three logical
volumes on the two arrays using the AIX volume manager. Two ofthese volumes are
used for the TPC-C data, index, and temp space, while the third volume is used for the
sequential workload. As shown in Table 2.3, the TPC-C throughput is 1.33N TpmC when
the sequential workload uses 896 KB requests and is 0.82N TpmC for 7MB requests. The
corresponding sequential workload throughput is 20.09 MB/s and 36.86 Mb/s, respectively.
Thus, we find that for the sequential workload, small requests favor narrow striping,
while large requests favor wide striping. For TPC-C workload, the reverse is true, i.e., small
40
![Page 58: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/58.jpg)
Striping Sequential Sequential Normalized TPC-CI/O Size Throughput Throughput
Narrow 896 KB 25.43 MB/s N TpmCWide 896 KB 20.09 MB/s 1.33 N TpmC
Narrow 7 MB 29.45 MB/s N TpmCWide 7 MB 36.86 MB/s 0.82 N TpmC
Table 2.3.TPC-C and Sequential Workload Throughput in Narrow and WideStriping
requests favor wide striping and large requests favor narrow striping. This is because the
performance of TPC-C is governed by the interference from the sequential workload. The
interference is greater when the sequential application issues large 7MB requests, resulting
in lower throughput for TPC-C. There is less interference when the sequential application
issues 898 KB (small) requests; further, TPC-C benefits fromthe larger number of arrays in
the wide striped system, resulting in a higher throughput. This behavior is consistent with
the experiments presented in previous sections. Furthermore, the performance difference
(i.e., improvement/degradation) between the two systems is around 20%, which is again
consistent with the results presented earlier.
2.3 Summary and Implications of our Experimental Results
Our experiments show that narrow striping yields better performance for small requests
when the streams can be ideally partitioned such that the partitions are load-balanced and
there is very little interference between streams within a partition. However, in the pres-
ence of workload skews that occur in real I/O workloads wide striping outperforms narrow
striping. In our trace-driven experiments, we found that when the the average load was
balanced, wide-striping performed comparably to narrow-striping. However, when we in-
troduced load imbalance by increasing the load on some partitions, wide striping outper-
formed narrow striping for the streams on the heavily loadedpartitions while performing
comparably for the remaining streams. With a TPC-C workload, we found that if the stripe
41
![Page 59: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/59.jpg)
unit is chosen appropriately, then narrow and wide stripinghave comparable performance
even though there are no workload skews due to the “constant-on” nature of the bench-
mark. In our closed-loop testbed experiments and the TPC-H experiments we found the
performance of narrow and wide striping to be comparable.
In situations where it is beneficial to do narrow striping, significant efforts are required
to extract those benefits. First, the workload has to be determined either from an initial
specification or by system measurement. Since narrow placement derives benefits from ex-
ploiting the correlation structure between streams, the characteristics of the streams as well
as the correlations between the streams needs to be determined. It is not known whether
stream characteristics or the inter-stream correlations are stable over time. Hence, if the
assumptions made by the narrow placement technique change,then load imbalances and
hot-spots may occur. These hot-spots have to be detected andthe system re-optimized us-
ing techniques such as [9]. This entails moving stores between arrays to achieve a new
layout [39]. The process of data movement itself has overheads that can effect the per-
formance. Furthermore, data migration techniques are onlyuseful for long-term or per-
sistent workload changes; short-time scale hot-spots thatoccur in modern systems can not
be effectively resolved by such techniques. Thus, it is not apparent it is possible to ex-
tract the benefits of narrow-striping for dynamically changing (non-stationary) workloads.
Storage systems that employ narrow striping [7, 9] have onlycompared performance with
manually-tuned narrow striped systems. While these studies have shown that such sys-
tems can perform comparably or outperform human-managed narrow striped systems, no
comprehensive comparison with wide striping was undertaken in these efforts.
In contrast to narrow striping, which requires detailed workload knowledge, the only
critical parameter in wide striping seems to be the stripe unit size. Our experiments high-
light the importance of choosing an appropriate stripe unitfor each store in a wide striping
system (for example, large stripe units for streams with large requests). While an optimal
stripe unit size may itself depend on several workload parameters, our preliminary experi-
42
![Page 60: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/60.jpg)
ments indicate that choosing the stripe unit size based on the average request size is a good
rule of thumb. For example, in our experiments, we chose the stripe unit to be half the
average request size. Detailed analytical and empirical models for determining the optimal
stripe unit size also exist in the literature [20, 21, 50].
For storage management one must also consider issues unrelated to performance when
choosing an appropriate object placement technique. For example, system growth has dif-
ferent consequences for narrow and wide striping. In case ofnarrow striping, when ad-
ditional storage is added, data does not have to be necessarily moved; data needs to move
only to ensure optimal placement. In case of wide-striping,data on all stores needs to be re-
organized to accommodate the new storage. Although this functionality can be automated
and implemented in the file system, volume manager, or raid controllers without requiring
application down-time, the impact of this issue depends on the frequency of system growth.
In enterprises environments, system growth is usually governed by purchasing cycles that
are long. Hence, we expect this to be an infrequent event and not be a significant issue
for wide-striping. In environments where system growth is frequent, however, such data
reorganizations can impose a large overhead.
A storage system may also be required to provide different response time or throughput
guarantees to different applications. The choice between narrow and wide striping in such
a case would depend on the Quality of Service (QoS) control mechanisms that are available
in the storage system. For example, if appropriate QoS-aware disk scheduling mechanisms
exist in the storage system [54], then it may be desirable to do wide striping. If no QoS
control mechanisms exist, a system can either isolate stores using narrow striping, or group
stores with similar QoS requirements, partition the systembased on storage requirements
of each group, and wide-stripe each group within the partition.
A final issue is system reliability. In narrow striping, whenmultiple disks fails on a
RAID array, only stores mapped onto that array are rendered unavailable. In contrast, all
stores are impacted by the failure of any one RAID array in wide striping. The overall
43
![Page 61: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/61.jpg)
choice between wide and narrow striping will be dictated by acombination of the above
factors.
2.4 Related Work
The design of self-managing storage systems was pioneered by [7, 10, 9, 39], where
techniques for automatically determining storage system configuration were studied. This
work determines: (1) the number and types of storage systemsthat are necessary to support
a given workload, (2) the RAID levels for the various objects, and (3) the placement of the
objects on the various arrays. The placement technique is based on narrow striping. It
exploits access correlation between streams, and collocates bandwidth-bound and space-
bound objects to determine an efficient placement. The focusof our work is different; we
assume that the number of storage arrays as well as the RAID levels are predetermined and
study the suitability of wide and narrow striping for storage systems.
Analytical and empirical techniques for determining file-specific stripe unit, placing
files on disk arrays, and cooling hot-spots have been studiedin [20, 21, 37, 50]. Our
work addresses a related but largely orthogonal question ofthe benefits of wide and narrow
striping for storage systems.
While much of the research literature has implicitly assumed narrow striping, at least
one database vendor has recently advocated wide striping due to its inherent simplicity [38].
A cursory evaluation of wide striping combined with mirroring, referred to asStripe and
Mirror Everything Everywhere (SAME), has been presented in [1]; the work uses a simple
storage system configuration to demonstrate that wide striping can perform comparably to
narrow striping. To the best of our knowledge, ours is the first work that systematically
evaluates the tradeoffs of wide and narrow striping.
44
![Page 62: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/62.jpg)
2.5 Concluding Remarks
Storage management cost is a significant fraction of the total cost of ownership of enter-
prise systems. Consequently, software automation of common storage management tasks
so as to reduce the total cost of ownership is an active area ofresearch. In this chapter,
we focused on the problem of storage space allocation. We studied two fundamentally
different storage allocation techniques: narrow and wide striping. Whereas wide striping
techniques need very little workload information for making placement decisions, narrow
striping techniques employ detailed information about theworkload to optimize the place-
ment and achieve better performance. We systematically evaluated this trade-off between
simplicity and performance. Using synthetic and real I/O workloads, we found that an ide-
alized narrow striped system can outperform a comparable wide-striped system for small
requests. However, wide striping outperforms narrow striped systems in the presence of
workload skews that occur in real systems; the two systems perform comparably for a va-
riety of other real-world scenarios. Our experiments demonstrate that the additional work-
load information needed by narrow placement techniques maynot necessarily translate to
better performance. Based on our results, we advocate narrow striping only when (i) the
workload can be characterized precisely a priori, and (ii) it is feasible to use data migration
to handle workload skews and workload interference. In general, we argue for simplicity
and recommend that (i) storage systems use wide striping forobject placement, and (ii) suf-
ficient information be specified at storage allocation time to enable appropriate selection of
the stripe unit size.
45
![Page 63: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/63.jpg)
CHAPTER 3
SELF-MANAGING BANDWIDTH ALLOCATION IN AMULTIMEDIA FILE SERVER
In Chapter 2 we evaluated different placement techniques todetermine their suitability
for a self-managing storage system. Some management tasks,however, require attention
on a continual basis. In this chapter, we focus on automatingone such short-term reconfig-
uration task, namely bandwidth allocation.
Placement of data objects is the first task faced by a storage system administrator. Large
scale storage systems host data objects of multiple types which are accessed by applica-
tions with diverse service requirements. By partitioning disk bandwidth between appli-
cation classes one can (i) align the service provided with the application requirements,
and (ii) protect application classes from one another. A number of rate-based schedulers
that support class-based bandwidth reservations have beenproposed [13, 42, 43, 54, 60].
However, since workload changes dynamically, a static reservation may not be appropriate.
For better application performance it is desirable that with changing workload the band-
width be dynamically allocated to various classes. In this chapter, we focus on the specific
problem of self-managing bandwidth allocation in a multimedia file-server. By a multime-
dia file-server, we mean one that services a heterogeneous mix of conventional best-effort
and soft real-time streaming media workloads (as opposed tocontinuous media servers
that solely service streaming media workloads). By self-managing bandwidth allocation,
we mean techniques to monitor the file server workload and dynamically reallocate band-
width to various classes for improved application performance. We develop and evaluate
a measurement-based inference technique to address the problem. Note, that since such a
46
![Page 64: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/64.jpg)
Text SM SMText Text SM
(a) Best−effort service
(b) Mutually−exclusive storage
(c) Reservations
Figure 3.1. Three techniques for supporting multiple application classes at a file server.
technique requires continual workload monitoring and bandwidth reallocation, it classifies
as a short-term reconfiguration task.
3.1 Self-Managing Bandwidth Allocation: Problem Definition
Consider a file server that services both streaming media andtraditional best-effort re-
quests. Most modern file servers belong to this category—they service requests for a mix
of streaming media, image and textual data (as anecdotal evidence, consider users who
store MP3 audio files and digital images along with traditional textual/numeric documents
in their home directories). The workload serviced by such a file server can be broadly clas-
sified into two categories: best-effort and soft real-time.The best-effort class comprises of
requests for traditional text/numeric and image data. Applications in this class need low
average response times or high aggregate throughput, but donot require any performance
guarantees. In contrast, the soft real-time class comprises of requests for streaming media
data; applications in this class impose deadlines that mustbe met but can tolerate an occa-
sional violation of these deadlines. Since the two classes have different characteristics and
performance requirements, modern file servers must addressthe challenge of reconciling
this heterogeneity.
47
![Page 65: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/65.jpg)
A file server can employ one of three different techniques formanaging these two
classes (see Figure 3.1).� Best-effort service: In the simplest case, the file server does not employ any spe-
cialized techniques for managing the two classes and provides a simple best-effort
service to both textual and streaming media requests. In such a scenario, the perfor-
mance requirements of soft real-time requests can be met only by over-engineering
the capacity of the server and running the server at a low utilization levels. Since file
server workloads are often bursty [27], performance guarantees of real-time requests
are violated if a transient increase in the workload causes saturation. Another limita-
tion is that requests from the two classes can interfere withone another—a burst of
real-time requests can starve best-effort requests and vice versa. Due to these limi-
tations, the overall utility of this approach to streaming media applications is often
unsatisfactory.� Mutually exclusive storage: An alternate approach is to store files from the two
application classes on a mutually exclusive set of disks. Such a static partitioning of
storage resources precludes the possibility of interference between the two classes.
Moreover, guarantees of soft real-time requests can be met by employing simple ad-
mission control algorithms. Although conceptually simple, this approach has certain
limitations. In particular, this approach is feasible onlyso long as the placement
of files on disks can be carefully controlled (to ensure mutually exclusive storage
of files). Unless the mapping of files to disks can be transparently handled by the
file system, placing restrictions on end-users that dictatewhere to store each type of
file is cumbersome, since users are used to the simplicity of creating and grouping
arbitrary files in their directories. A more serious problemis that of performance—
studies have shown that the static partitioning of storage space and disk bandwidth
required by this approach results in up to a factor of six lossin performance (due to
the lack of statistical multiplexing) [53].
48
![Page 66: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/66.jpg)
� Reservation-based approach:A third approach is to share storage space among
the two classes but reserve a certain fraction of the bandwidth on each disk for each
class (i.e., stores files from both classes on all the disks but reserve disk bandwidth for
each class). By sharing storage resources, the file server can extract statistical mul-
tiplexing gains; by reserving bandwidth, it can prevent interference among classes
and meet the performance guarantees of the soft real-time class. Thus, a reservation-
based approach overcomes the limitations of the previous two approaches. LetRrtdenote the fraction of the bandwidth reserved for the soft real-time class; the remain-
ing fractionRbe = 1�Rrt is used (reserved) for the best-effort class. The challenge
in designing a reservation-based approach lies in determining an appropriate parti-
tioning Rrt andRbe such that both classes see acceptable performance (i.e., meet
the deadlines of real-time requests while providing low average response times for
best-effort requests). Modern file systems such as SGI’s XFS[31] and IBM’s Tiger
Shark [29] support the notion of reservations. XFS, for instance, does so using its
guaranteed-rate I/O feature [31].
Due to the inherent advantages and flexibility of the reservation-based approach we
assume a file server that supports bandwidth reservations for each class.
There are several approaches for determine the aggregate bandwidth reservation for
each class. In the simplest case, the partitioning of bandwidth among the two classes can
be done manually. This can be done using past observations orfuture estimates of the
load to determine the long-term usage in each class. Whereasthis approach is feasible on
the time-scale of days, short-term variations on the time-scale of tens of minutes or hours
cannot be handled by the approach (since this would involve frequent manual intervention).
Further, since the partitioning must be recomputed every sooften to account for long-term
variations in the load within each class, the possibility ofhuman error can not be completely
eliminated.
49
![Page 67: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/67.jpg)
An alternate approach is to automate the monitoring of the workload within each class
and dynamically partition the bandwidth among the two classes. We refer to such an ap-
proach asself-managing bandwidth allocation. By actively monitoring the load, the ap-
proach can react to workload changes on the time scale of minutes or hours. Furthermore,
the approach can also handle transient overloads in the system and ensure stable overload
behavior. A limitation of the approach, however, is that it increases the complexity of
the file server. The design of such a self-managing bandwidthallocator involves two key
challenges: (i) the design of efficient workload monitoringtechniques that have a minimal
impact on overall system performance, and (ii) the design ofadaptive techniques that use
past workload statistics to dynamically determine bandwidth allocation for the two classes.
We first address the simpler problem of self-managing bandwidth allocation in a single
disk file server and then use these insights to design a self-managing bandwidth allocator
for multi-disk servers.
3.2 Self-Managing Bandwidth Allocation in a Single Disk Server
In this section, we first present the system model assumed in our research. We then out-
line the requirements that must be met by a self-managing bandwidth allocator and finally
present an outline of the bandwidth allocation technique that meets these requirements.
3.2.1 System Model
Consider a single disk file server that services two classes of applications—best-effort
and soft real-time. Let us assume that the server reserves a certain fraction of the disk
bandwidth for each application class. LetRbe andRtr denote the reserved fractions, re-
spectively,0 � Rbe; Rrt � 1 andRbe = 1 � Rrt. Given the reservationsRbe andRrt, we
assume that the file sever employs a disk scheduling algorithm that can enforce these allo-
cations. A number of rate-based schedulers that support class-based bandwidth reservation
have been proposed [13, 42, 43, 54, 60]. Any such scheduler issuitable for our purpose
50
![Page 68: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/68.jpg)
(since our bandwidth allocator does not make any specific assumptions about the schedul-
ing algorithm). It is possible that the scheduler may itselffurther partition the bandwidth
allocated to a class among individual applications. We are only concerned about the ag-
gregate bandwidth needs of each class; the partitioning of this aggregate among individual
applications is an orthogonal issue.
3.2.2 Requirements
Assuming the above system model, consider a bandwidth allocation technique that dy-
namically determines the fractionsRbe andRrt based on the load in each class. Such a
self-managing bandwidth allocator should meet four key requirements.� Time-scale of allocation and monitoring: Depending on the environment, band-
width allocation can be performed on the time-scales ranging from a few minutes
to tens of hours. Allocating bandwidth on (small) time-scales of minutes allows the
server to respond to short term variations in the load but canresult in frequent fluctu-
ations in the allocations. In contrast, allocating bandwidth on large time-scales (e.g.,
hours or days) allows the server to focus on long-term trendsin the workload while
effectively ignoring short term variations. Depending on the environment, small
time-scale or large time-scale allocation or both may be necessary. A bandwidth
allocator should allow a server administrator to specify the time-scale(s) of interest
and recompute allocations based on this specification.� Control over allocations: In addition to control over the time-scale of allocations,
the bandwidth allocator should allow control over the allocation itself. Allocating
bandwidth solely based on past usage can be problematic. Forinstance, if applica-
tions in a certain class are idle, its allocation can shrink to zero resulting in starvation
for future applications. To avoid such situations, the bandwidth allocator should per-
mit the server administrator to specify constraints on the allocations. This could be
done, for instance, by specifying a set of rules that govern the actual allocations.
51
![Page 69: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/69.jpg)
X X X X X X X XX
window size W
Itime
measurements
Figure 3.2. A Moving Histogram� Stable overload behavior: A bandwidth allocator should exhibit stable behavior
even in the presence of transient overloads. Since the capacity of the server is ex-
ceeded during an overload, bandwidth allocation by itself can not remedy the situa-
tion. However, the allocator can (and should) make intelligent allocation decisions
that prevent unstable system behavior during overloads.� Exploit the semantics of each class:Requests within the best-effort class desire
low average response times, while those within the real-time class have associated
deadlines that must be met. Since the two classes have different performance re-
quirements, the allocator should exploit the semantics of each class and use different
criteria to allocate bandwidth to these classes. This can beachieved, for instance, by
using the average load to determine the allocation of the best-effort class and the tail
of the load distribution to determine the allocation of the real-time class.
Next we present our workload monitoring module and our adaptive bandwidth manager
that meets these requirements.
3.2.3 Monitoring the Workload in the Two Classes
The workload monitoring module tracks several parameters (listed below) that are rep-
resentative of the load within each class; the bandwidth manager then uses these parameters
to compute the allocation of each class. For each such parameter, the monitoring module
computes a probability distribution using the concept of amoving histogram. A moving his-
52
![Page 70: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/70.jpg)
togram is simply a histogram computed over a moving time window. A moving histogram
is characterized by two parameters: the window sizeW and the measurement intervalI(see Figure 3.2). The window sizeW determines the interval of time over which the his-
togram is computed. Data values are recorded into the histogram everyI time units. Thus,
the parameter of interest is monitored over the measurementintervalI and the mean value
of that parameter over that interval is recorded into the histogram. The least recent value
is then dropped from the histogram, effectively sliding thewindow byI time units. Thus,
each histogram hasbWI data samples. By carefully choosingW andI, it is possible to
exercise control over the time-scale over which the load is monitored.
The monitoring module tracks various aspects of resource usage from the time a request
arrives to the time it is serviced by the disk. Monitored parameters include request arrival
rates, request waiting times and disk utilizations within each class (see Figure 3.3).� Request arrival rates: Over each intervalI, the module monitors the number of
request arrivals in each class (denoted byNbe andNrt) and the request sizes (SbeandSrt). The number of arrivals and the mean request size in that interval are then
recorded into moving histograms.� Request waiting times:Rather than monitoring the actual request waiting times, our
monitoring module uses queue lengths as an indicator of the time each request waits
in the system before it is serviced—larger the queue of outstanding requests, greater
is the waiting time. This is achieved by recording the instantaneous queue lengths of
the two classes (denoted byqbe andqrt) at the end of each intervalI.� Disk Utilizations: The module uses the disk utilizations as a measure of the actual
bandwidth consumed by each class. The utilization of a classis defined to be the
fraction of the time spent by the disk in servicing requests from that class. It is
computed asUbe = Pj � jbeI andUrt = Pj � jrtI , where� jbe and� jrt denote the time spent by
the disk in servicing an individual best-effort and soft real-time request, respectively.
53
![Page 71: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/71.jpg)
RequestArrivals
Parameters monitored
Instantenous queue lengths (q)
Disk utilizations (U)
number of requests (N) request sizes (S)
Requestwait times
Requestservice times
Figure 3.3.Parameters tracked by the monitoring module
The utilizations within each class are then recorded into moving histograms at the
end of each intervalI.
3.2.4 Adapting the Allocation of Each Class
The bandwidth manager uses the histograms computed by the monitoring module to
periodically recompute the bandwidth allocation (reservation) of each class. The manager
provides control over the time-scale of allocation using a parameterP that defines the
period of these recomputations. Recall that the monitoringmodule uses a window sizeWfor each moving histogram. In general, the recomputation periodP can be smaller or larger
thanW . If allocations are recomputed more frequently thanW (i.e.,P < W ) then some
measurements used in the previous computations are reused to compute the new allocations
(since those measurements would still be contained in the windowW of the histogram).
In contrast, ifP > W , then some load measurements are never taken into account for
computing the allocations. Consequently, usingP = W is a good rule of thumb to ensure
a responsive file server. In the rest of this chapter, we assumeP = W .
54
![Page 72: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/72.jpg)
The bandwidth manager uses a rule-based system to provide control over the allocation
to each class. Such a rule-based system supports a set of user-defined rules that govern
these allocations. Our bandwidth manager currently supports rules that specify upper and
lower bounds for each class. That is, a server administratorcan specify bounds (denoted
by [Rminbe ; Rmaxbe ℄ and[Rminrt ; Rmaxrt ℄) on the bandwidth allocated to each class. Bounds on
allocations are useful to prevent scenarios where a class receives either too little or too
much bandwidth (without such bounds, the allocation of the aclass could shrink to zero if
the class is idle, causing starvation for newly arriving requests).
Given the recomputation periodP and bounds on the allocation of each class, the band-
width manager estimates the bandwidth needs of each class using two metrics: (i) disk
utilizations and (ii) request arrival rates.
3.2.4.1 Estimating Bandwidth Requirement based on Disk Utilizations
The bandwidth manager uses the moving histograms of the diskutilizations to estimate
the bandwidth needs of each class. Since the two classes havedifferent performance char-
acteristics, a different metric is used to compute these estimates. In case of the best-effort
class, the bandwidth manager uses themedianof the utilization distribution, denoted byMedian(Ube), as an estimate of the bandwidth requirement (this is because requests in this
class desire low average response times). In contrast, a high percentile of the utilization,
denoted byPer (Urt), is used to estimate the requirements of the real-time class(since the
tail of the distribution better reflects the needs of real-time requests). The exact percentile
used to estimate the bandwidth requirements can be chosen statically or dynamically. In
the latter case, the percentile could be a function of the variance in the load—the greater
the variance, the higher the percentile used to estimate thebandwidth requirements. To
illustrate, the percentile can be chosen asbase per entile+ log(Cv) whereCv is the coef-
ficient of variation and is computed asCv = �(Urt)=E(Urt); E and� are the mean and the
standard deviation of the distribution.
55
![Page 73: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/73.jpg)
After computing these utilizations, the bandwidth manageruses an exponential smooth-
ing function to weigh the current estimate with past estimates. That is,Median�(Ube) = � �Median(Ube) + (1� �) �Median�(Ube) (3.1)
and Per �(Urt) = � � Per (Urt) + (1� �) � Per �(Urt) (3.2)
where� is the exponential smoothing parameter,0 � � � 1. A large value of� biases
the estimates towards the immediate past measurements, whereas a small� reduces the
contribution of recent measurements.
3.2.4.2 Estimating Bandwidth Requirement based on the Arrival Rate
Whereas the actual disk utilization is a good indicator of the needs of each class when
the disk in not saturated (no overload), a different metric is needed during periods of tran-
sient overloads. This is because the total disk utilizationis always 100% during an overload
and no longer reflects the relative needs of each class. Consequently, the bandwidth man-
ager uses request arrival rates to estimate the bandwidth needs of each class during transient
overloads. In general, a class with larger arrival rates should be allocated a larger propor-
tion of the disk bandwidth. Observe that since the capacity of the disk is exceeded during
an overload, no allocation can actually satisfy thetotal bandwidth needs of two classes.
In such a scenario, the goal of the bandwidth manager should be to ensure stable overload
behavior and ensure that the allocations reflect therelativeneeds of the two classes.
To estimate the bandwidth needs based on arrival rates, the bandwidth manager first
computes the number of requests arriving in each class and the request size and uses a
simple disk model to estimate the bandwidth needs. As in the case of disk utilization,
exponentially smoothed values of the median and a high percentile of these distributions
56
![Page 74: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/74.jpg)
are used for the best-effort and real-time class, respectively. Thus, the bandwidth needs of
the best-effort class are computed asBbe = Median�(Nbe) � (tseek + trot + Median�(Sbe)txfr ) (3.3)
and those of the soft real-time class are computed asBrt = Per �(Nrt) � (tseek + trot + Per �(Srt)txfr ) (3.4)
wheretseek, trot andtxfr denote the average seek overload, average rotational latency and
the data transfer rate of the disk, respectively. Note that the first term in the above expres-
sion represents the number of disk requests, while the second term represents the time to
service each disk request.
3.2.4.3 Computing the Reservations of Each Class
The bandwidth manager begins by initializing the allocation of each class to a user-
specified value (Rinitbe andRinitrt ). After each interval ofP time units, the bandwidth man-
ager estimates the bandwidth needs of each class, using the techniques described above,
and then computes the new allocations using the following algorithm.� Case 1:Neither class utilizes its entire allocation. This scenario occurs whenMedian�(Ube) < Rbe andPer �(Urt) < Rrt. Since neither class is utilizing its
entire allocation, no action is necessary. Hence, the allocations of the two classes
remains unchanged.� Case 2:The best-effort class utilizes its entire allocation.This scenario occurs whenMedian�(Ube) � Rbe andPer �(Urt) < Rrt. Since the best-effort class utilizes or
57
![Page 75: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/75.jpg)
exceeds its allocated share1 and the real-time class is under-utilized, the bandwidth
manager should increase the allocation of the best-effort class (and correspondingly
decrease the allocation of the real-time class). This is achieved by settingRnewbe =Median�(Ube) (3.5)
The allocation of the real-time class is then set toRnewrt = 1� Rnewbe .� Case 3:The real-time class utilizes its entire allocation.In this scenario,Median�(Ube) < Rbe andPer �(Urt) � Rrt. Since load in the real-time class
equals or exceeds its allocation, the allocation of this class should be increased ap-
propriately. Consequently, the bandwidth manager sets thenew allocation of the
class to Rnewrt = Per �(Urt) (3.6)
The allocation of the best-effort class is set toRnewbe = 1�Rnewrt .� Case 4: Overload. An overload is said to occur when both classes use up their
entire allocations (resulting in saturation) or the queue of pending requests exceeds a
threshold. That is, (i)Median�(Ube) � Rbe andPer �(Urt) � Rrt; or (ii) qbe � Qor qrt � Q, whereQ is a large threshold. Since disk utilizations are not representative
of the relative requirements of the two classes during an overload, the bandwidth
manager uses the request arrival rate to compute the allocation of each class. Given
the bandwidth estimates,Bbe andBrt, based on arrival rates, the new allocations are
computed as Rnewbe = BbeBbe +Brt (3.7)
1Depending on the scheduling algorithm, an application class might use more bandwidth than its reservedshare. This happens when the other class is under-utilized and the scheduler reallocates unused bandwidth toneedy applications in the first class.
58
![Page 76: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/76.jpg)
and Rnewrt = BrtBbe +Brt (3.8)
As explained earlier, the use of the relative bandwidth needs of the two classes to
compute allocations results in more stable overload behavior.
The above allocations are then constrained (if necessary) using user-specified bounds[Rminbe ; Rmaxbe ℄ and[Rminrt ; Rmaxrt ℄.Our adaptive algorithm has the following salient features:(1) it provides control over
the time-scale of monitoring and allocation via two tunableparameters:P (= W ) and�(in general, larger recomputation periods and smaller�s bias the allocator to long-term
variations in the load), (2) it allows control over the allocation via a set of rules to constrain
the allocation, (3) it employs techniques to provide stableoverload behavior, and (4) it
exploits the semantics of each class by using different metrics (median and percentiles of
the distribution) to estimate bandwidth needs. Thus, the bandwidth allocator meets all of
the requirements outlined in Section 3.2.2.
In what follows, we show how to enhance this technique to allocate bandwidth in multi-
disk servers.
3.3 Self-Managing Bandwidth Allocation in a Multi-disk Server
Due to the sheer volume of data stored on servers, modern file servers employ multiple
disks or disk arrays as their underlying storage medium. A multi-disk server can employ
one of two placement techniques to store files—each file can bemapped to a single disk or
the server can employ striping to interleave the storage of afile across multiple disks. In the
former case, the load on each disk is independent of the load on remaining disks, whereas
in the latter case the load on disks are related to one another. It is trivial to extend our
self-managing bandwidth allocation technique to multi-disk servers where each file maps
onto a single disk—since the disk loads are independent, theallocator can monitor a disk
and allocate bandwidth independently of other disks. A different technique is needed when
59
![Page 77: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/77.jpg)
files are striped across multiple disks or when it is desirable to treat multiple independent
disks as a single logical storage device for purposes of bandwidth allocation.
One possible approach is to monitor the load on each disk and first compute the alloca-
tion on individual disks using the algorithm described in Section 3.2.4.3. The actual allo-
cation of each class is then set to the mean allocation over all disks in the array. Whereas
such an approach results in satisfactory performance for the best-effort class, it can ad-
versely affect the performance of the real-time class. Thisis because the load on various
disks can be different and the use of the average load to determine the allocation of the
real-time class can affect requests accessing heavily loaded disks. An alternate approach
is to set the allocation of each class to that on the most heavily loaded disk in the system.
However, a problem with the approach is that the load on the most heavily loaded disk can
significantly differ from that on the average loaded disk andusing the load on the former to
govern the allocation on the latter can cause a mismatch between the allocation and the ac-
tual load (thereby defeating the purpose of bandwidth allocation). Thus, neither approach
is satisfactory for allocating bandwidth on a disk array.
In what follows, we present a hybrid approach that takes intoaccount the load on the
heavily loaded disks as well as the average load to compute the allocations of the two
classes. We use the same notation as that in the single disk case with an additional super-
script to denote a particular disk (thusRibe denotes the allocation of the best-effort class on
disk i). Based on the load parameters tracked by the monitoring module, we first compute
the allocations on individual disks as follows:Ribe = 8><>: Median�(U ibe) if no overloadBibeBirt+Bibe if disk i is overloaded(3.9)
and Rirt = 8><>: Per �(U irt) if no overloadBirtBirt+Bibe if disk i is overloaded(3.10)
60
![Page 78: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/78.jpg)
The average allocation of the best-effort class across all disks is thenRavgbe = avg(R1be; R2be;: : : ; RDbe) and the maximum allocation of the class on any disk isRmaxbe = max(R1be; R2be;: : : ; RDbe), whereD denotes the number of disks in the array. The average and the maximum
allocations of the real-time class across all disks can be computed similarly. The bandwidth
manager then computes the allocation of each class as a linear combination of the average
and the maximum load. That is,Rbe = �Rmaxbe + (1� ) �Ravgbe (3.11)
where the parameter , 0 � � 1, determines the contribution of the average and the
maximum load to the final allocation. Similarly, the allocation of the real-time class isRrt = �Rmaxrt + (1� ) �Ravgrt (3.12)
Finally, since the fractionsRbe andRrt may not sum to 1 (due to the skew between
the average and maximum loads and the parameter ), the final allocation is normalized as
follows: Rnewbe = RbeRbe +Rrt ; Rnewrt = RrtRbe +Rrt (3.13)
As in the single-disk case, the new allocations are constrained (if necessary) using the user-
specified upper and lower bounds. These allocations are thenused on each individual disk
for the nextP time units.
Observe that Equations 3.11 and 3.12 are key to multi-disk bandwidth allocation—
the choice of an appropriate helps balance the contribution of heavily loaded disks and
average loaded disks to the final allocation for each class.
61
![Page 79: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/79.jpg)
3.4 Experimental Methodology
We evaluate the efficacy of our self-managing bandwidth allocator using a simulation
study. In what follows, we describe the simulation environment and the workload charac-
teristics used in our experiments and then describe our experimental results.
3.4.1 Simulation Environment
We used an event-based disk simulator to evaluate our bandwidth allocation technique.
Our simulator can simulate both single-disk and multi-diskservers. In either case, we
assume that the server supports two application classes—best-effort and soft real-time.
Requests from these classes are assumed to be serviced usingthe Cello disk scheduling
algorithm [54]. The Cello disk scheduler supports reservations for each class and uses
class-specific policies to service requests in the two classes; the SCAN policy is used to
service best-effort requests, while SCAN-EDF is used to service real-time requests with
deadlines. Note that, any other disk scheduler that supports class-specific reservations can
be used in conjunction with our bandwidth allocator withoutsignificantly affecting our
results. The file server is assumed to use one or more Seagate Elite-3 disks to store files
from the two application classes.2 The block size used for storing text files is assumed to be
4KB, while that for the video files is 64KB. In case of disk-arrays (i.e., a multi-disk server),
all files are assumed to be striped across disks in the array.
The workload monitoring module employed by the simulator tracks various load pa-
rameters as described in Section 3.2.3. The moving histograms computed by the module
are the used by the bandwidth manager to compute the allocation for each class (as de-
scribed in Sections 3.2.4 and 3.3). The allocation of each class is assumed to be initialized
toRinitbe = Rinitrt = 0:5 at the beginning of each simulation experiment.
2The Seagate Elite disk has an average seek overhead of 11 ms, an average rotational latency of 5.55 msand a data transfer rate of 4.6 MB/s.
62
![Page 80: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/80.jpg)
Number of read/write operations 218724Average bit rate (original) 218.64 KB/sAverage bit rate (with 64MB cache) 83.91 KB/sAverage inter-arrival (original) 9.14 msAverage inter-arrival (with 64MB cache) 22.53 msAverage request size 2048.22 bytesPeak to average bit rate (1s intervals) 12.51
Table 3.1.Characteristics of the Auspex NFS trace
3.4.2 Workload Characteristics
We use two types of workloads in our experiments: trace-driven and synthetic. Our
trace workloads have been gathered from a real file-server and enable us to determine the
efficacy of our methods for real-world scenarios. However, since a trace workload only
represents a small subset of the operating region of a file server, we use synthetic work-
loads to systematically explore the state space. Next we describe the characteristics of the
workloads used in our experiments.
3.4.2.1 Best-effort Text Clients
We used portions of a NFS trace gathered from an Auspex file server at Berkeley to
generate the trace-driven text workload [23]. The characteristics of these workloads are
shown in Table 3.1. We assumed a 64MB LRU buffer cache at the server and filtered out
requests resulting in cache hits from the original trace; the remaining requests are assumed
to result in disk accesses. Figure 3.4 illustrates the characteristics of the resulting workload.
As shown in the figure, the text workload is very bursty; the peak to average bit rate of the
trace was measured to be 12.5.
To systematically explore the state space, we also use a synthetically generated text
workload. Each text client in the synthetic workload is assumed to be sequential or random.
The simulator allows control over the fractionsf and1� f of sequential and random text
clients in the workload. Clients are assumed to arrive and depart at random time instants.
63
![Page 81: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/81.jpg)
0
500000
1e+06
1.5e+06
2e+06
2.5e+06
3e+06
0 250 500 750 1000 1250 1500 1750 2000
Byt
es/s
Time (s)
Figure 3.4.Bursty nature of the NFS trace workload.
Inter-arrival times of clients are assumed to be exponentially distributed. Upon arrival,
each client is assumed to access a random file and file sizes (and hence, client life times)
are assumed to be heavy-tailed with a Pareto distribution. These assumptions, namely
exponential interarrivals and Pareto file sizes, are consistent with studies of real-world text
clients [16, 27].
3.4.2.2 Soft Real-time Video Clients
Each video client in our simulator emulates a video player and reads a randomly se-
lected video file at a constant frame rate (e.g., 30 frames/s). Depending on the compression
algorithm, the selected video file may have a constant or a variable bit rate. Table 3.2 lists
the characteristics of video files used in our simulations. As shown in the table, we use
a mix of high bit-rate MPEG-1 files and low bit-rate MPEG-4 files. Since much of the
existing online streaming media content is low bit-rate (e.g., WindowsMedia, RealMedia),
this allows us to experiment with existing workloads as wellfuture higher bit-rate work-
loads. All video clients are assumed to be serviced in the server-push (streaming) mode.
The server services these clients in periodic rounds by retrieving a fixed number of frames
in each round. Disk requests for all active video clients areissued at the beginning of each
64
![Page 82: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/82.jpg)
File Type Length Bit rate(frames)
Frasier MPEG-1 5960 1.49 Mb/sNewscast MPEG-1 9000 2.33 Mb/s
Silence of the Lambs MPEG-4 89998 107 Kb/s
Table 3.2.Characteristics of Video traces
round and have the end of that round as their deadlines. The round duration was set to
1000ms in our simulations.
We used observations from a recent study of an actual streaming media workload [22]
to simulate the arrival process for video clients (since thetraces used in that study are not
publicly available, we couldn’t use the trace itself). Video clients are assumed to arrive and
depart at random instants. Inter-arrival times are exponential, the object popularity is Zipf,
and the client life-times are heavy-tailed. We assumed no correlation between object sizes
and object popularity, consistent with observations made in recent studies [16].
3.5 Experimental Evaluation
In what follows, we present the results of our experimental evaluation using the trace
and synthetic workloads described in the previous section.
3.5.1 Ability to Adapt to Changing Workloads
In this experiment, we show how our bandwidth allocation technique can adapt to
changing workloads. We assume a single disk server and construct a workload scenario
that exercises all four cases of the allocation algorithm listed in Section 3.2.4.3. To do so,
we assume synthetic text and video clients that arrive and depart at random instants. Text
clients are assumed to be sequential and access 10KB of the file every 250ms. Each video
client is assumed to access a MPEG-1 file. The window sizeW and the recomputation
periodP were set to 100 seconds, the measurement intervalI was 1s and the smoothing
65
![Page 83: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/83.jpg)
parameter� was 0.75. The percentile used for estimating the needs of thereal-time class
was set to 90.
Figure 3.5(a) depicts the variation in the number of text andvideo clients over the
duration of the experiment (note that the figure only denotesthenumberof clients in each
class, not their aggregate bandwidth requirements). We start with a small number of text
and video clients att = 0. At t = 500, there is a sudden burst of new video client arrivals
(triggering case 3 in Section 3.2.4.3). The video burst subsides att = 1500 and a burst of
text clients occurs att = 2000 (case 2). Att = 4000, there is a simultaneous burst of text
and video requests, resulting in transient overload at the server (case 4).
Figure 3.5(b) shows the allocations of the two classes for this workload, while Figures
3.5(c) and (d) plot the utilization of each class with the corresponding allocations. As
shown in Figure 3.5(b), the allocation of the real-time class increases att = 500 due to the
video burst, while that of the best-effort class increases at t = 2000 due to the text burst. Att = 4000, the server experiences an overload and the bandwidth manager uses the request
arrival rates to determine the allocations. Moreover, Figure 3.5(c) shows that allocation of
the real-time class is always a high percentile of the load (evident from the relative values
of the allocation and the utilization), whereas Figure 3.5(d) shows that the allocation of
the best-effort class is median value of the utilization. Finally, observe that in the periods1500 � t � 2000 and3000 � t � 3500, neither class utilizes its allocated share, causing
the allocations to remain unchanged (case 1).
3.5.2 Bandwidth Allocation in a Single-disk Server
In this experiment, we demonstrate the efficacy of our approach for a single disk server.
Whereas we performed experiments with both trace and synthetic workloads, due to space
constraints we present our results only for trace workloads.
66
![Page 84: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/84.jpg)
0
5
10
15
20
25
2000 4000 6000 8000 10000
No.
of C
lient
s in
App
licat
ion
Cla
ss
Time (secs.)
Real timeBest effort
0
0.2
0.4
0.6
0.8
1
2000 4000 6000 8000 10000
Allo
catio
n
Time (secs.)
Real timeBest effort
(a) Workload (b) Allocations
0
0.2
0.4
0.6
0.8
1
2000 4000 6000 8000 10000
Util
izat
ion
Time (secs.)
Class allocationClass utilization
0
0.2
0.4
0.6
0.8
1
2000 4000 6000 8000 10000
Util
izat
ion
Time (secs.)
Class allocationClass utilization
(c) Utilization of the real-time class (d) Utilization of the best-effort class
Figure 3.5.Adaptive allocation of disk bandwidth
0
0.2
0.4
0.6
0.8
1
2000 4000 6000 8000 10000
Allo
catio
n
Time (secs.)
Real timeBest effort
0
0.2
0.4
0.6
0.8
1
2000 4000 6000 8000 10000
Util
izat
ion
Time (secs.)
Class allocationClass utilization
(a) Bandwidth allocations (b) Utilization of the best-effort class
Figure 3.6.Bandwidth allocation in a single-disk server.
67
![Page 85: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/85.jpg)
Our experiment uses NFS traces (with a scale factor of 3) to generate a bursty text
workload3, while keeping the video load fixed over the duration of each simulation run.
We repeated the experiment for background video loads ranging from 1 to 10 simultaneous
MPEG clients. This enabled us to study the impact of a bursty text load with varying
background video loads. Each run simulates 2.8 hours of the workload on the file server.
Note also that while the number of video clients is fixed for each simulation run, each client
may impose a varying load due to the variable bit rate nature of video files.
Figures 3.6(a) and (b) plot the allocation of the two classesand the utilization of the
best-effort class for one such combination (namely, NFS workload with 7 background
video clients). As shown in the Figure 3.6(b), the allocation of the best-effort class closely
matches the disk utilization of that class, thereby demonstrating the effectiveness of the
bandwidth allocator.
3.5.3 Bandwidth Allocation in a Multi-disk Server
In this experiment, we demonstrate the efficacy of our approach for a multi-disk server.
Like in the single disk case, we conducted experiments with both trace and synthetic work-
loads. Due to space constrains, we only present our results for synthetic workloads.
We assumed a multi-disk server with eight disks. Both text and video files are assumed
to be striped across all disks in the array. The parameter that determines the contribution
of the maximum load and the average load across disks was chosen to be 0.75. Like in the
single disk case, we choseW = P = 100s andI = 1s.
The inter-arrival times of text clients were exponentiallydistributed with a mean of 10s
and the lifetimes of these clients were heavy-tailed with a mean of 4 minutes. Half of the
text clients were sequential and the other half random. Inter-arrival times of video clients
were also exponential with a mean of 1 minute, with a heavy-tailed lifetime of 4 minutes.
3The scale factor scales the interarrival times of requests and allows control over the burstiness of theworkload.
68
![Page 86: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/86.jpg)
0
0.2
0.4
0.6
0.8
1
2000 4000 6000 8000 10000
Allo
catio
nTime (secs.)
Real timeBest effort
(a) Allocations
0
0.2
0.4
0.6
0.8
1
2000 4000 6000 8000 10000
Util
izat
ion
Time (secs.)
Class allocationMaximum utilization
0
0.2
0.4
0.6
0.8
1
2000 4000 6000 8000 10000
Util
izat
ion
Time (secs.)
Class allocationMean utilization
(b) Maximum Utilization of RT class (c) Mean Utilization of RT class
Figure 3.7.Bandwidth allocation in a multi-disk server.
The popularity of video files was Zipf with a parameter of 0.47[22]. These parameters were
chosen such that the text load was mostly stable, while the video load steadily increased
over the duration of the experiment, eventually resulting in an overload.
Figure 3.7 (a) shows the allocation of the two classes as computed by our multi-disk
bandwidth allocator. Figures 3.7(b) and (c) plot the maximum utilization of the soft real-
time class on any disk and the mean utilization across all disks, respectively (along with
the corresponding allocations). As expected, we see that the allocation of the soft real-time
class increases steadily with the load. Eventually, some ofthe disks in the array experience
an overload and our allocator uses request arrival rates to compute the allocations. Note
69
![Page 87: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/87.jpg)
also that since we chose = 0:75, the allocation on the average disk is slightly larger than
the utilization on that disk.
3.5.4 Impact of Tunable Parameters
In this section, we show how tunable parameters such as the recomputation periodP (=W ) and the smoothing parameter� can be used to control the time-scale of bandwidth
allocations.
The video load for this experiment was kept fixed over the duration of the simulation.
The text load is initially steady for the first 2200 seconds and a burst occurs between2200 �t � 2800 (the burst is characterized by a sharp increase in the numberof text clients
followed by a sharp decrease). Figure 3.8(a) plots this variations in the text load.
We varied� from 0.25 to 1 and computed the allocations of the best-effort class. In
general, a large value of� causes the bandwidth manager to maintain less history and
biases the allocations towards more recent measurements. This allows the server to react to
small variations in the load. In contrast, small values of� smooth out recent variations in
the load, making the server less sensitive to recent load changes. Figure 3.8(b) demonstrates
this behavior for different values of�. As shown in the figure, when� = 1 the bandwidth
manager quickly increases the allocation of the best-effort class to match the increase in
utilization due to the burst. The increase in allocation is slower for smaller values of�.
For instance, when� = 0:25 the allocation increases slowly to 60% and doesn’t increase
further since the burst subsides quickly.
Next, we variedP and studied its effect on the allocation. Figure 3.8(c) depicts the al-
location of the best-effort class for different values ofP . A larger recomputation period al-
lows the bandwidth manager to focus on long-term trends and ignore short-term variations,
while a smaller recomputation period enables the server to respond to short-term variations.
Figure 3.8(c) demonstrates this behavior. WhenP = 100, the allocation of the best-effort
class quickly increases to match the increase in the load. Incontrast, whenP = 500 the
70
![Page 88: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/88.jpg)
0
0.2
0.4
0.6
0.8
1
1000 2000 3000 4000 5000
Util
izat
ion
Time (secs.)
Workload
(a) Utilization
0
0.2
0.4
0.6
0.8
1
1000 2000 3000 4000 5000
Allo
catio
n
Time (secs.)
alpha = 1.00alpha = 0.75alpha = 0.50alpha = 0.25
0
0.2
0.4
0.6
0.8
1
1000 2000 3000 4000 5000
Allo
catio
n
Time (secs.)
P = 500P = 200P = 100
(b) Effect of� (c) Effect ofPFigure 3.8. Effect of various tunable parameters on the granularity of bandwidth alloca-tions.
time-scale of interest becomes larger than the duration of the burst and consequently the
bandwidth manager ignores the burst altogether and keeps the allocation unchanged.
Together, these experiments demonstrate how these tunableparameters can be used to
control the granularity of bandwidth allocation and the sensitiveness to load fluctuations.
3.5.5 Comparison with Static Allocation
Finally we demonstrate the advantages of our dynamic allocation technique over static
bandwidth allocation. We initialize the allocation of the two classes to 50% of the total
disk bandwidth. Whereas the allocation remains fixed for static partitioning, it varies with
71
![Page 89: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/89.jpg)
the load for dynamic allocation. We examine a scenario wherethe server experiences a
transient overload due to a burst in the real-time class and measure the queue length of
the real-time requests. Since the allocation remains fixed in former scenario, the server is
unable to respond to an overload, causing the queue of real-time requests to grow quickly.
In contrast, our bandwidth allocation technique uses request arrival rates to determine the
allocation of each class and allocates a larger bandwidth tothe real-time class. This enables
the server to exhibit a more stable behavior during an overload, resulting in a more graceful
increase in the queue length (the average queue length is also 59% smaller). We repeat the
experiment with a steady video load and a burst in the best-effort class. Again, the server
is unable to respond to the burst in case of static allocation,whereas our dynamic allocator
allocates a larger bandwidth to the best-effort class, resulting in significantly better response
times. Figures 3.9(a) and (b) demonstrate this behavior.
Dynamic bandwidth allocation can also be advantageous whenthe server employs ad-
mission control for the real-time class. If the server were to employ static bandwidth al-
location, then the admission controller would only admit asmany clients as the allocation
of the real-time class permits; additional real-time clients would be rejected from the sys-
tem even when the best-effort class is not using its entire allocation (i.e., the system has
spare capacity). In contrast, dynamic bandwidth allocation allows the server to gradually
increase the allocation of the real-time class based on its usage, thereby allowing the ad-
mission controller to admit additional clients. This results in more judicious use of system
resources. We compared static allocation to our dynamic allocation technique in the pres-
ence of admission control in the real-time class. Our experiment consisted of a fixed text
load and a video arrival every 500s. The initial allocation of the two classes was 50%.
As shown in Figure 3.9(c), dynamic bandwidth allocation permits additional clients to be
admitted into the system so long as there is unused bandwidthin the best-effort class. To-
gether these experiments demonstrate the benefits of a dynamic bandwidth allocation over
static allocation.
72
![Page 90: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/90.jpg)
0
200
400
600
800
1000
1200
1400
1600
1000 2000 3000 4000
Que
ue le
ngth
Time (secs.)
Dynamic AllocationStatic Allocation
(a) Queue lengths of the real-time class
0
500
1000
1500
2000
2500
3000
1000 2000 3000 4000 5000
Ave
rage
Res
pons
e T
ime
(ms)
Time (secs.)
Dynamic AllocationStatic Allocation
0
2
4
6
8
10
12
2000 4000 6000 8000 10000
No.
of V
ideo
Clie
nts
Time (secs.)
Dynamic AllocationStatic Allocation
(b) Text response times (c) Impact of admission control
Figure 3.9.Comparison with Static Partitioning
73
![Page 91: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/91.jpg)
3.6 Related Work
A number of recent and ongoing research efforts have focusedon the design of self-
managing systems [34, 49]. The IStore project, for instance, investigated the design of
work-load monitoring and adaptive resource management techniques for data-intensive net-
work services [17]. Unlike their focus on data-intensive network applications, the focus of
our work is on mixed (best-effort and streaming media) workloads. The VINO project has
investigated the design of self-managing techniques for various OS tasks such as paging,
interrupt latency and disk waits [52]. Research on storage systems at HP Labs has also
investigated various issues in self managing systems such as self-configuration (Minerva
[8]), capacity planning [14] and goal-based storage management [8]. Finally, a number of
predictable disk scheduling algorithms have been proposed[13, 42, 43, 54, 60]. As indi-
cated earlier, these efforts are complementary to our effort, since our bandwidth allocator
can coexist with any such scheduler.
3.7 Concluding Remarks
In this chapter, we focused on the problem of self-managing bandwidth allocation to
improve the manageability of modern file servers. We presented two techniques for dy-
namic bandwidth allocation—one for single disk servers andthe other for servers employ-
ing multiple disks or disk arrays. Both techniques consist of two components: a workload
monitoring module that efficiently monitors the load in eachapplication class and a band-
width manager that uses these workload statistics to dynamically determine the allocation
of each class. We have evaluated the efficacy of our techniques via a simulation study us-
ing synthetic and trace workloads [56]. Our results show that these techniques (i) provide
control over the time-scale of allocation via tunable parameters, (ii) have stable behavior
during overload, and (iii) provide significant advantages over static bandwidth allocation.
74
![Page 92: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/92.jpg)
CHAPTER 4
LEARNING-BASED APPROACH FOR DYNAMIC BANDWIDTHALLOCATION
In the previous chapter we looked at the problem of dynamic bandwidth allocation in
the context of multimedia servers. We assumed that the workload is classified into two
application classes, namely soft real-time and best-effort, based on the data type. Multiple
data types is just one aspect of the problem; one could also have a large storage system
hosting multiple application classes with different performance requirements.
In this chapter, we assume that the storage system is accessed by applications that
can be categorized into different classes; each class is assumed to impose a certain QoS
requirement in the form a response time requirement. The workload seen by an application
class varies over time, and we address the problem of how to allocate storage bandwidth to
classes in presence of varying workloads so that their QoS needs are met. We use a learning
based approach to address the problem. In the next section, we present the system model,
outline the key requirements of the bandwidth allocation technique and then describe the
problem in further detail.
4.1 Problem Definition
4.1.1 Background and System Model
An enterprise storage system consists of a large number of disks that are organized into
disk arrays. A disk array is a collection of physical disks that presents an abstraction of
a single large logical storage device to the rest of the system; we refer to this abstraction
as a logical unit (LU). An application, such as a database or afile system, is allocated
75
![Page 93: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/93.jpg)
storage space by concatenating space from one or more logical units; the concatenated
storage space is referred to as a logical volume (LV). Figure4.1 illustrates the mapping
from logical volumes to logical units.
We assume that the workload accessing each logical volume can be partitioned intoap-
plication classes. This grouping can be determined based on the files accessed by requests
in each class or the QoS requirements of these requests. Eachapplication class is assumed
to have a certain response time requirement. Application classes compete for storage band-
width and the bandwidth allocated to a class governs the response time of its requests.
To enable such allocations, each disk in the system is assumed to employ a QoS-aware
disk scheduler (such as [13, 54, 60]). Such a scheduler allows disk bandwidth to be reserved
for each class and enforces these allocations at a fine time scale. Thus, if a certain disk
receives requests fromn application classes, then we assume that the system dynamically
determines the reservationsR1; R2; � � �Rn for these classes such that the response time
needs of each class are met andPni=1Ri = 1 (the reservationRi essentially denotes the
fraction of the total disk bandwidth allocated to classi; 0 � Ri � 1).
4.1.2 Key Requirements
Assuming the above system model, consider a bandwidth allocation technique that dy-
namically determines the reservationsR1; R2; ::::; Rn based on the requirements of each
class. Such a bandwidth allocation scheme should satisfy the following key requirements.� Meet class response time requirements:Assuming that each class specifies a tar-
get response-timedi, the bandwidth allocation techniques should allocate sufficient
bandwidth to each class to meet its target response-time requirements. Whether this
goal can be met depends on the load imposed by each application class and the ag-
gregate load. In scenarios where the response time needs of aclass cannot be met
(possibly due to overload), the bandwidth allocation technique should attempt to
minimize the difference between the observed and the targetresponse times.
76
![Page 94: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/94.jpg)
Logical Unit (LU)LU
DisksDisks
Logical Volume Logical Volume Logical Volume
Class 2Class 1 Class 3 Class 4 Class 5
Logical volumes on the extreme right and left are accessed bytwo application classes each, and theone in the center by a single application class. The storage system sees a total of five applicationclasses. Disks comprising the left LU see requests from classes 1,2 and 3; disks on the right LU seeworkload from all 5 application classes.
Figure 4.1.Relationship between application classes, logical volumes and logical units.� Performance isolation: Whereas the dynamic allocation technique should react to
changing workloads, for example, by allocating additionalbandwidth to classes that
see an increased load, such increases in allocations shouldnot affect the perfor-
mance of less loaded classes. Thus, only spare bandwidth from underloaded classes
should be reallocated to classes that are heavily loaded, thereby isolating underloaded
classes from the effects of overload.� Stable overload behavior:Overload is observed when the aggregate workload ex-
ceeds disk capacity, causing the target response times of all classes to be exceeded.
The bandwidth allocation technique should exhibit stable behavior under overload.
This is especially important for a learning-based approach, since such techniques
systematically search though various allocations to determine the correct allocation;
doing so under overloads can result in oscillations and erratic behavior. A well-
designed dynamic allocation scheme should prevent such unstable system behavior.
77
![Page 95: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/95.jpg)
4.1.3 Problem Formulation
To precisely formulate the problem addressed, consider an individual disk from a large
storage system that services requests fromn application classes. Letd1; d2; : : : ; dn denote
the target response times of these classes. LetRt1; Rt2; : : : ; Rtn denote the response time
of these classes observed over a periodP . Then the dynamic allocation technique should
compute reservationsR1; R2; � � � ; Rn such thatRti � di for any classi subject to the
constraintPiRi = 1 and0 � Ri � 1. Since it may not always be possible to meet
the response time needs of each class, especially under overload, we modify the above
condition as follows: instead of requiringRti � di; 8i, we require that the response time
should be less than or as close to the target as possible. Thatis, (Rti�di)+ should be equal
to or as close to zero as possible (the notationx+ equalsx for positive values ofx and
equals 0 for negative values). Instead of attempting to meetthis condition for each class,
we define a new metric sigma+rt = nXi=1 (Rti � di)+ (4.1)
and require thatsigma+rt be minimized. Observe that,sigma+rt represents the aggregate
amount by which the response time targets of classes are exceeded. Minimizing a sin-
gle metricsigma+rt enables the system to collectively minimize the QoS violations across
application classes.
We now present a learning-based approach that tries to minimize thesigma+rt observed
at each disk subject to the key requirements specified in Section 4.1.2.
4.2 A Learning-based Approach
In this section, we first present some background on reinforcement learning and then
present a simple learning-based approach for dynamic storage bandwidth allocation. We
discuss limitations of this approach and present an enhanced learning-based approach that
overcomes these limitations.
78
![Page 96: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/96.jpg)
4.2.1 Reinforcement Learning Background
Any learning-based approach essentially involves learning from past history. Rein-
forcement learning involves learning how to map situationsto actionsso as to maximize
a numericalreward (equivalent of acostor utility function) [58]. It is assumed that the
system does not know which actions to take in order to maximize the reward; instead the
system must discover (“learn”) the correct action by systematically trying various actions.
An actionis defined to be one of the possible ways to react to the currentsystem state. The
system state is defined to be a subset of what can be perceived from the environment at any
given time.
In the dynamic storage allocation problem, an action is equivalent to setting the allo-
cations (i.e., the reservations) of each class. The system state is the vector of the observed
response times of the application classes. The objective ofreinforcement learning is to
maximize the reward despiteuncertaintyabout the environment (in our case, the uncer-
tainty arises due to the variations in the workload). An important aspect of reinforce-
ment learning is that, unlike some learning approaches, no prior training of the system is
necessary—all the learning occurs online, allowing the system to deal with unanticipated
uncertainties (e.g., events, such as flash crowds, that can not have been anticipated in ad-
vance). It is this feature of reinforcement learning that makes it particularly attractive for
our problem.
A reward function defines the goal in the reinforcement learning; by mapping an action
to a reward, it determines the intrinsic desirability of that state. For the storage alloca-
tion problem, we define the reward function to be�sigma+rt—maximizing reward implies
minimizing sigma+rt and the QoS violations of classes. In reinforcement learning, we use
reward values learned from past actions to estimate the expected reward of a (future) action.
With the above background, we present a reinforcement learning approach based on
action valuesto dynamically allocate storage bandwidth to classes.
79
![Page 97: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/97.jpg)
4.2.2 System State
A simple definition of system state is a vector of the responsetimes of then classes:(Rt1; Rt2; : : : ; Rtn), whereRti denotes the mean response time of classi observed over
a periodP . Since the response time of a class can take any arbitrary value, the system
state space is theoretically infinite. Further, the system state by itself does not reveal if
a particular class has met its target response time. Both limitations can be addressed by
discretizing the state space as follows: partition the range of the response time (which is[0;1)) into four partsf[0; di � �i℄; (di � �i; di℄; (di; di + �i℄; (di + �i;1)gand map the observed response timeRti into one of these sub-ranges (�i is a constant). The
first range indicates that the class response time is substantially below its target response
time (by a threshold�i). The second (third) range indicates that the response timeis slightly
below (above) the target and by no more than the threshold�i. The fourth range indicates
a scenario where the target response time is substantially exceeded. We label these four
states aslo�, lo, hi andhi+, respectively, with the labels indicating different degrees of
over- and under-provisioning of bandwidth (see Figure 4.2). The state of a class is defined
asSi 2 flo�; lo; hi; hi+g and the modified state space is a vector of these states for each
class:S = (S1; S2; : : : ; Sn). Observe that, since state of a class can take only four values,
the potentially infinite state space is reduced to a size of4n.
4.2.3 Allocation Space
The reservation of a classRi is a real number between 0 and 1. Hence, the alloca-
tion space(R1; R2; : : : ; Rn) is infinite due to the infinitely many allocations for each class.
Since a learning approach must search through all possible allocations to determine an ap-
propriate allocation for a particular state, this makes theproblem intractable. To discretize
the allocation space, we impose a restriction that requiresthe reservation of a class be mod-
80
![Page 98: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/98.jpg)
lo− hi+lo hi
Increasing Response Time
Response Time Requirement = d
Heavy
Overload
Heavy
Underload OverloadUnderload
d − tau d + tau
0 oo
Figure 4.2. Discretizing the State Space
ified in steps ofT , whereT is an integer. For instance, if the step size is chosen to be
1% or 5%, the reservation of a class can only be increased or decreased by a multiple of
the step size. Imposing this simple restriction results in afinite allocation space, since the
reservation of a class can only take one ofm possible values, wherem = 100=T . With nclasses, the number of possible combinations of allocations is(m+n�1m ), resulting in a finite
allocation space. Choosing an appropriate step size allowsallocations to be modified at a
sufficiently fine grain, while keeping the allocation space finite. In what follows, we use
the termsactionandallocationinterchangeably.
4.2.4 Cost and State Action Values
For the above definition of state space, we observe that the response time needs of
a class are met so long it is in thelo� or lo states. In the event an application class is
in hi or hi+ states, the system needs to increase the reservations of theclass, assuming
spare bandwidth is available, to induce a transition back tolo� or lo. This is achieved
by computing a new set of reservations(R1; R2; : : : ; Rn) so as to maximize the reward�sigma+rt. Note that the maximum value of the reward is zero, which occurs when the
response time needs of all classes are met (see Equation 4.1).
A simple method for determining the new allocation is to pickone based on the ob-
served rewards of previous actions from this state. An action (allocation) that resulted in
largest reward (�sigma+rt) is likely to do so again and is chosen over other lower reward
81
![Page 99: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/99.jpg)
actions. Making this decision requires that the system firsttry out all possible actions, pos-
sibly multiple times, and then choose one that yields the largest reward. Over a period of
time, each action may be chosen multiple times and we store anexponential average of the
observed reward from this action (to guide future decisions):Qnew(S1;S2;::::;Sn)(a) = �Qold(S1;S2;::::;Sn)(a) + (1� ) � �sigma+rt(a) (4.2)
whereQ denotes the exponentially averaged value of the reward for action a taken from
state(S1; S2; : : : ; Sn) and is the exponential smoothing parameter (also known as the
forgetting factor). Learning methods of this form, where the actions selected are based
on estimates of action-reward values (also referred to as action values), are referred to as
action-value methods.
We choose an exponential average over a sample average because the latter is appro-
priate only for stationary environments. In our case, the environment is non-stationary due
to the changing workloads and the same action from a state mayyield different rewards
depending on the current workload. For such scenarios, recency-weighted exponential av-
erages are more appropriate. With4n states and(m+n�1m ) possible actions in each state, the
system will need to store(m+n�1m ) � 4n such averages, one for each action.
4.2.5 A Simple Learning-based Approach
A simple learning approach is one that systematically triesout all possible allocations
from each system state, computes the reward for each action and stores these values to
guide future allocations. Note that it is the discretization of the state space and the allo-
cation space as described in Sections 4.2.4 and 4.2.2 which make this approach possible.
Once the reward values are determined for the various actions, upon a subsequent transi-
tion to this state, the system can use these values to pick a allocation with the maximum
reward. The set of learned reward values for a state is also referred to as thehistoryof the
state. As an example, consider two application classes thatare allocated 50% each of the
82
![Page 100: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/100.jpg)
disk bandwidth and are in(lo�; lo�). Assume that a workload change causes a transition to(lo�; hi+). Then the system needs to choose one of several possible allocations:(0; 100),(5; 95), (10; 90); : : :, (100; 0). Choosing one of these allocations allows the system to learn
the reward�sigma+rt that accrues as a result of that action. After trying all possible al-
locations, the system can use these learned values to directly determine an allocation that
maximizes reward (by minimizing the aggregate QoS violations). This quicker and suit-
able reassignment of class allocations is facilitated by learning. Figure 4.3 shows the steps
involved in a learning-based approach.
Although such a reinforcement learning scheme is simple to design and implement, it
has numerous drawbacks.� Actions are oblivious of system state:A key drawback of this simple learning
approach is that the actions are oblivious of the system state—the approach tries
all possible actions, even ones that are clearly unsuitablefor a particular state. In the
above example, for instance, any allocation that decreasesthe share of the overloadedhi+ class and increases that of the underloadedlo� class is incorrect. Such an action
can worsen the overall system performance. Nevertheless, such actions are explored
to determine their reward. The drawback arises primarily because the semantics of
the problem are not incorporated into the learning technique.� No performance isolation: Since the system state is not taken into account while
making allocation decisions, the approach can not provide performance isolation to
classes. In the above example, an arbitrary allocation of(0; 100) can severely affect
thelo� class while favoring the overloaded class.� Large search space and memory requirements:Since there are(m+n�1m ) possible
allocations in each of the4n states, a systematic search of all possible allocations is
impractical. This overhead is manageable whenn = 2 classes andm = 20 (which
corresponds to a step size of 5%;m = 100=5), since there are only(2120) = 2183
![Page 101: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/101.jpg)
Update Action
Values
Average Class
Response Times
Determine System
State
Compute New Allocation
Queues
Class Specific
Storage Device
(Disk)
Requests
Compute Reward
Period P
Sleep for Re−computation
QoS Aware
Disk Scheduler
Figure 4.3.Steps involved in learning
allocations for each of the42 = 16 states. However, forn = 5 classes, the number
of possible actions increases to 10626 for each of the45 states. Since the number
of possible actions increases exponentially with increasein the number of classes, so
does the memory requirement (since the reward for each allocation needs to be stored
in memory to guide future allocations). Forn = 5 classes andm = 20, 83MB of
memory is needed per disk to store these reward values. This overhead is impractical
for storage systems with large number of disks.
4.2.6 An Enhanced Learning-based Approach
We design an enhanced learning approach that uses the semantics of the problem to
overcome the drawback of the naive learning approach outlined in the previous section.
The key insight used in the enhanced approach is to use the state of a class to determine
whether to increase or decrease its allocation (instead of naively exploring all possible
allocations). In the example listed in the previous section, for instance, only those allo-
cations that increase the reservation of the overloaded class and decrease the allocation of
the underloaded class are considered. The technique also includes provisions to provide
84
![Page 102: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/102.jpg)
Class Response Times
Deterime System State
All Classes in
hi or hi+ ?
class in hi+If ( < its defaultallocation )
set allocation to default ;
else
allocation of some
leave allocation unchanged ;
Leave Allocation
Unchanged
All Classes in
same State ?
elseTake Action based on Reward
if ( history exists )
Reassign T from Underloaded
to Overloaded Class
Some Overloaded
and Some Underloaded ?
lo− or lo
Figure 4.4.Algorithm flowchart
performance isolation, achieve stable overload behavior,and reduce memory and search
space overheads.
Initially, we assume that the allocations of all classes areset to a default value (a simple
default allocation is to assign equal shares to the classes;any other default may be speci-
fied). We assume that the allocations of classes are recomputed everyP time units. To do
so, the technique first determines the system state and then computes the new allocation for
this state as follows:� Case I: All classes are underloaded (are inlo� or lo). Since all classes are inlo orlo�, by definition, their response time needs are satisfied and noaction is necessary.
Hence, the allocation is left unchanged. An optimization ispossible when some
classes are inlo� and some are inlo. Since the goal is to drive all classes to as low as
state as possible, one can reallocate bandwidth from the classes inlo� to the classes
in lo. How bandwidth is reallocated and history maintained to achieve this is similar
to the approach described in Case III below.� Case II: All classes are overloaded (are inhi or hi+). Since all classes are inhior hi+, the target response times of all classes are exceeded, indicating an overload
85
![Page 103: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/103.jpg)
situation. While every class can use extra bandwidth, none exists in the system. Since
no spare bandwidth is available, we leave the allocations unchanged.
An additional optimization is possible in this state. If some class is heavily over-
loaded (i.e., is inhi+) and is currently allocated less than its initial default allocation,
then the allocation of all classes is set to their default values (the allocation is left un-
changed otherwise). The insight behind this action is that no class should be inhi+due to starvation resulting from an allocation less than itsdefault. Resetting the al-
locations to their default values during such heavy overloads ensures that the system
performance is no worse than a static approach that allocates the default allocation to
each class.� Case III: Some classes are overloaded, others are underloaded (some in hi+ or hiand some inlo or lo�). This is the scenario where learning is employed. Since some
classes are underloaded while others are overloaded, the system should reallocate
spare bandwidth from underloaded classes to overloaded classes. Initially, there is
no history in the system and the system mustlearnhow much bandwidth to reassign
from underloaded to overloaded classes. Once some history is available, the reward
values from past actions can be used to guide the reallocation.
The learning occurs as follows. The application classes arepartitioned into two
sets: lendersandborrowers. A class is assigned to the lenders set if it is inlo orlo�; classes inhi andhi+ are deemed borrowers. The basic idea is to reduce the
allocation of a lender byT and reassign this bandwidth to a borrower. Note that
the bandwidth of only one lender and one borrower is modified at any given time
and only by the step sizeT ; doing so systematically reassigns spare bandwidth from
lenders to borrowers, while learning the rewards from theseactions.
Different strategies can be used to pick a lender and a borrower. One approach is
to pick the most needy borrower and the most over-provisioned lender (these classes
can be identified by how far the class is from its target response time; the greater
86
![Page 104: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/104.jpg)
this difference, the greater the need or the available sparebandwidth). Another ap-
proach is to cycle through the list of lenders and borrowers and reallocate bandwidth
to classes in a round-robin fashion. The latter strategy ensures that the needs of all
borrowers are met in a cyclic fashion, while the former strategy focuses on the most
needy borrower before addressing the needs of the remainingborrowers. Regardless
of the strategy, the system state is recomputedP time units after each reallocation. If
some classes continue to be overloaded, while others are underloaded, we repeat the
above process. If the system transitions to a state defined byCase I or II, we handle
them as discussed above.
The reward obtained after each allocation is stored as an exponentially-smoothed
average (as shown in Equation 4.2). However, instead of storing the rewards of all
possible actions, we only store the rewards of the actions that yield thek highest
rewards. The insight here is that the remaining actions do not yield a good reward
and, since the system will not consider them subsequently, we do not need to store
the corresponding reward values. These actions and their corresponding reward esti-
mates are stored as a link list, with the neighboring elements in the link list differing
in the allocations of two classes by the step sizeT , that of one lender and one bor-
rower. This facilitates a systematic search of the suitableallocation for a state, and
also pruning of the link list to maintain a size of no more thank. By storing a fixed
number of actions and rewards for any given state, the memoryrequirements can
be reduced substantially. Further, while the allocation ofa borrower and a lender
is changed only byT in each step during the initial learning process, these can be
changed by a larger amount subsequently once some history isavailable (this is done
by directly picking the allocation that yields the maximum reward).
Figure 4.4 summarizes our technique. As a final optimization, we use a small non-zero
probability� to bias the system to occasionally choose a neighboring allocation instead of
the allocation with the highest reward (a neighboring allocation is one that differs from
87
![Page 105: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/105.jpg)
the best allocation by the step sizeT for the borrowing and lending classes, e.g.,(30; 70)instead of(35; 65) whenT = 5%). The reason we do this is that it is possible the value of
an allocation is underestimated as a result of a sudden workload reversal, and the system
may thus select the best allocation based on the current history. An occasional choice of a
neighboring allocation ensures that the system explores the state space sufficiently well to
discover a suitable allocation.
Observe that our enhanced learning approach reclaims bandwidth only from those
classes that have bandwidth to spare (lo and lo� classes) and reassigns this bandwidth to
classes that need it. Since a borrower takes up bandwidth in increments ofT from a lender,
the lender could in the worst case end up in statehi 1. At this stage there would be a state
change, and the action would be dictated by this new state. Thus, this strategy ensures that
any new allocation chosen by the approach can only improve (and not worsen) the system
performance; doing so also provides a degree of performanceisolation to classes.
The technique also takes the current system state into account while making allocation
decisions and thereby avoids allocations that are clearly inappropriate for a particular state;
in other words, the optimized learning technique intelligently guides and restricts the al-
location space explored. Further, since only thek highest reward actions are stored, the
worst case search overhead is reduced toO(k). This results in a substantial reduction from
the search overheads of the simple learning approach. Finally, the memory needs of the
technique reduce from(m+n�1m ) to 4n � k, wherek is the number of high reward actions for
which history is maintained. This design decision also results in a substantial reduction in
the memory requirements of the approach. In the case of 5 application classes,T = 5%(recallm = 100=T ) andk = 5, for example, the technique yields more than 99% reduction
in memory needs over the simple learning approach.
1The choice of the step sizeT is of importance here. If the step-size is too big the overloaded class couldend up in underload and vice versa and this could result in oscillations.
88
![Page 106: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/106.jpg)
4.3 Implementation in Linux
We have implemented our techniques in the Linux kernel version 2.4.9. Our prototype
consists of three components: (i) a QoS-aware disk scheduler that supports per-class reser-
vations, (ii) a module that monitors the response time requirements of each class, and (iii)
a learning-based bandwidth allocator that periodically recomputes the reservations of the
classes on each disk. Our prototype was implemented on a DellPowerEdge server (model
2650) with two 1 GHz Pentium III processors and 1 GB memory that runs RedHat Linux
7.2. The server was connected to a Dell PowerVault storage pack (model 210) with eight
SCSI disks. Each disk is a 18GB 10,000 RPM Fujitsu MAJ3182MC disk; the characteris-
tics of the disk are shown in Table 2.1 in Chapter 22. We use the software RAID driver in
Linux to configure the system as a single RAID-0 array.
We implement the Cello QoS-aware disk scheduler in the Linuxkernel [54]. The disk
scheduler supports a configurable number of application classes and allows a fraction of the
disk bandwidth to be reserved for each class (these can be setusing the scheduler system
call interface). These reservations are then enforced on a fine time scale, while taking
disk seek overheads into account. We extend theopensystem call to allow applications to
associate file I/O with an application class; all subsequentread and write operations on the
file are then associated with the specified class. The use of our enhanced open system call
interface requires application source code to be modified. To enable legacy application to
benefit from our techniques, we also provide a command line utility that allows a process
(or a thread) to be associated with an application class—allsubsequent I/O from the process
is then associated with that class. Any child processes thatare forked by this process inherit
these attributes and their I/O requests are treated accordingly.
We also add functionality into the Linux kernel to monitor the response times of re-
quests in each class (at each disk); the response time is defined to the sum of the queuing
2The Fujitsu MAJ3182MC disk has an average seek overhead of 4.7 ms, an average latency of 2.99 msand a data transfer rate of 39.16 MB/s.
89
![Page 107: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/107.jpg)
delay and the disk service times. We compute the mean response time in each class over a
moving window of durationP .
The bandwidth allocator runs as a privileged daemon in user space. It periodically
queries the monitoring module for the response time of each class; this can done using a
special-purpose system call or via the=pro interface in Linux. The response time values
are then used to compute the system state. The new allocationis then determined and
conveyed to the disk scheduler using the scheduler interface.
4.4 Experimental Evaluation
In this section, we demonstrate the efficacy of our techniques using a combination of
prototype experimentation and simulations. In what follows, we first present our simulation
methodology and simulation results, followed by results from our prototype implementa-
tion.
4.4.1 Simulation Methodology and Workload
We use an event-based storage system simulator to evaluate our bandwidth allocation
technique. The simulator simulates a disk array that is accessed by multiple application
classes. Each disk in the array is modeled as a 18GB 10,000 RPMFujitsu MAJ3182MC
disk. The disk array is assumed to be configured as a RAID-0 array with multiple volumes;
unless specified otherwise we assume an array of 8 disks . Eachdisk in the system is as-
sumed to employ a QoS-aware disk scheduler that supports class-specific reservations; we
use the Cello disk scheduler [54] for this purpose. Observe that the hardware configuration
assumed in our simulations is identical to that in our prototype implementation. We assume
that the system monitors the response times of each class over a periodP and recomputes
the allocations after each such period. We chooseP = 5s in our experiments. Unless
specified otherwise, we choose a target response time ofdi = 100ms for each class and the
90
![Page 108: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/108.jpg)
threshold�i for discretizing the class states into thelo�, lo, hi andhi+ categories is set to
20ms.
We use a two types of workloads in our simulations: trace-driven and synthetic. We
use NFS traces to determine the effectiveness of our methodsfor real-world scenarios.
However, since a trace workload only represents a small subset of the operating region, we
use a synthetic workload to systematic explore the state space.
We use portions of an NFS trace gathered from the Auspex file server at Berkeley
[23] to generate the trace-driven workload. To account for caching effects, we assume a
large LRU buffer cache at the server and filter out requests resulting in cache hits from the
original trace; the remaining requests are assumed to result in disk accesses. The resulting
NFS trace is very bursty and has a peak to average bit rate of 12.5.
Our synthetic workload consist of Poisson arriving clientsthat read a randomly selected
file. File sizes are assumed to be heavy-tailed; we assume fixed-size requests that sequen-
tially read the selected file. By carefully controlling the arrival rates of such clients, we can
construct transient overload scenarios (where a burst of clients arrive in quick succession).
Next, we present our experimental results.
4.4.2 Effectiveness of Dynamic Bandwidth Allocation
We begin with a simple simulation experiment to demonstratethe behavior of our dy-
namic bandwidth allocation approach in the presence of varying workloads. We configure
the system with two application classes. We choose an exponential smoothing parameter = 0:5, the learning step sizeT = 5% and the number of stored values per statek = 5.
The target response time is set to 75ms for each class and the re-computation period was
5s. Each class is initially assigned 50% of the disk bandwidth.
We use a synthetic workload for this experiment. Initially both classes are assumed to
have 5 concurrent clients each; each client reads a randomlyselected file by issuing 4 KB
requests. At timet = 100s, the workload in class 1 is gradually increased to 8 concurrent
91
![Page 109: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/109.jpg)
0
2
4
6
8
10
0 200 400 600 800 1000
No.
of C
lient
sTime (secs.)
Class 1Class 2
(a) Workload
0
50
100
150
200
250
300
350
400
450
500
200 400 600 800 1000
Ave
rage
Res
pons
e T
ime
(ms)
Time (secs.)
StaticLearning
Target Response Time
0
50
100
150
200
250
300
350
400
450
500
200 400 600 800 1000
Ave
rage
Res
pons
e T
ime
(ms)
Time (secs.)
StaticLearning
Target Response Time
(b) Average Response Time: Class 1 (c) Average Response Time: Class 2
Figure 4.5.Behavior of the learning-based dynamic bandwidth allocation technique.
clients. Att = 600s, the workload in class 2 is gradually increased to 8 clients.The system
experiences a heavy overload fromt = 700 to t = 900s. At t = 900s, several clients depart
and the load reverts to the initial load. We measure the response times of the two classes
and then repeat the experiment with a static allocation of(50%; 50%) for each class.
Figures 4.5 depicts the class response times. As shown the dynamic allocation tech-
nique adapts to the changing workload and yields response times that are close to the target.
Further, due to the adaptive nature of the technique, the observed response times are, for
the most part, better than that in the static allocation. Observe that, immediately after a
workload change, the learning technique requires a short period of time to learn and adjust
92
![Page 110: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/110.jpg)
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
2000 4000 6000 8000 10000
Cum
ulat
ive
QoS
Vio
latio
ns
Time (secs.)
Dy. Alloc. with LearningDy. Alloc. w/o Learning
Static
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
1000 2000 3000 4000 5000
Cum
ulat
ive
QoS
Vio
latio
ns
Time (secs.)
Enhanced LearningStatic
Naive Learning
(a) Trace Workload (b) Comparison with Simple Learning
Figure 4.6.Comparison with Alternative Approaches
the allocations, and this temporarily yields a response time that is higher than that in the
static case (e.g., att = 600s in Fig 4.5(b)). Also, observe that betweent = 700 andt = 900the system experiences a heavy overload and, as discussed inCase II of our approach, the
dynamic technique resets the allocation of bothhi+ classes to their default values, yielding
a performance that is identical to the static case.
4.4.3 Comparison with Alternative Approaches
In this section, we compare our learning-based approach with three alternate approaches:
(i) static, where the allocation of classes is chosen statically, (ii)dynamic allocation with
no learning, where the allocation technique is identical to our technique but no learning
is employed (i.e., allocations are left unchanged when all classes are underloaded or over-
loaded as in Cases I and II in Section 4.2.6, and in Case III bandwidth is reassigned from
the least underloaded class to the most overloaded class in steps ofT , but no learning is
employed), and (iii) thesimple learningapproach outlined in Section 4.2.5.
We use the NFS traces to compare our enhanced learning approach with the static and
the dynamic allocation techniques with no learning. We configure the system with three
93
![Page 111: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/111.jpg)
classes with different scale factors3 and set the target responses time of each class to 100ms.
The re-computation period is chosen to be 5s. We use different portions of our NFS trace
to generate the workload for the three classes. The stripe unit size for the RAID-0 array is
chosen to be 8 KB. We use about 2.8 hours of the trace for this experiment.
We run the experiment for our learning-based allocation technique and repeat it for
static allocation and dynamic allocation without learning. In figure 4.6(a) we plot the cu-
mulativeP sigma+rt (i.e., the cumulative QoS violations observed over the duration of
the experiment) for the three approaches; this metric helpsus quantify the performance of
an approach in the long run. Not surprisingly, the static allocation techniques yields the
worst performance and incurs the largest number of QoS violations. The dynamic alloca-
tion technique without learning yields a substantial improvement over the static approach,
while dynamic allocation with learning yields a further improvement. Observe that the
gap between static and dynamic allocation without learningdepicts the benefits of dynamic
allocation over static, while the gap between the technique without learning and our tech-
niquedepicts the additional benefits of employing learning. Overall, we see a factor of
3.8 reduction in QoS violations when compared to a pure static scheme and a factor of 2.1
when compared to a dynamic technique with no learning.
Our second experiment compares our enhanced learning approach with the simple
learning approach described in Section 4.2.5. Most parameters are identical to the pre-
vious scenario, except that we only assume two application classes instead of three for
this experiment. Figure 4.6(b) plots the cumulative QoS violations observed for the two
approaches (we also plot the performance of static allocation for comparison). As can be
seen, the naive learning approach incurs a larger search/learning overhead since it system-
atically searches through all possible actions. In doing so, incorrect actions that exacerbate
the system performance are explored and actually worsen performance. Consequently, we
3The scale factor scales the inter-arrival times of requestsand allows control over the burstiness of theworkload.
94
![Page 112: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/112.jpg)
0
0.5
1
1.5
2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Nor
mai
lized
Cum
ulat
ive
QoS
Vio
latio
nsForgetting Factor
(a) Effect of Smoothing Parameter
0
0.5
1
1.5
2
1 2 3 4 5 6 7 8 9 10
Nor
mai
lized
Cum
ulat
ive
QoS
Vio
latio
ns
Step Size
0
0.5
1
1.5
2
2 3 4 5 6 7 8 9 10
Nor
mai
lized
Cum
ulat
ive
QoS
Vio
latio
ns
No. of Values Stored Per State
(b) Effect of Step Size t (c) Effect ofkFigure 4.7. Impact of Tunable Parameters
see a substantially larger number of QoS violations in the initial period; the slope of the
violation curve reduces sharply once some history is available to make more informed de-
cisions. Consequently, during this initial learning process, a naive learning process under-
performs even the static scheme; the enhanced learning technique does not suffer from
these drawbacks, and like before, yields the best performance.
95
![Page 113: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/113.jpg)
4.4.4 Effect of Tunable Parameters
We conduct several experiments to study how the choice of three tunable parameters
affects the system behavior: the exponential smoothing parameter , the step sizeT and
the history sizek that defines the number of high reward actions stored by the system.
First, we study the impact of the smoothing parameter . Recall from Equation 4.1 that = 0 implies that only the most recent reward value is considered, while = 1 completely
ignores reward values. We chooseT = 5% andk = 5. We vary systematically from 0.0
to 0.9, in steps of 0.1 and study its impact on the observed QoSviolations. We normalize
the cumulative QoS violations observed for each value of with the minimum number of
violations observed for the experiment. Figure 4.7(a) plots our results. As shown in the
figure, the observed QoS violations are comparable for values in the range (0,0.6). The
number of QoS violations increases for larger values of gamma—larger values of provide
less importance to more recent reward values and consequently, result in larger QoS vio-
lations. This demonstrates that, in the presence of dynamically varying workloads, recent
reward values should be given sufficient importance. We suggest choosing a between 0.3
and 0.6 to strike a balance between the recent reward values and those learned from past
history.
Next, we study the impact of the step sizeT . We choose = 0:5, k = 4 and varyTfrom 1% to 10% and observe its impact on system performance. Note that a small value ofT allows fine-grain reassignment of bandwidth but can increase the time to search for the
correct allocation (since the allocation is varied only in steps ofT ). In contrast, a larger
value ofT permits a faster search but only permits coarse-grain reallocation. Figure 4.7(b)
plots the normalized QoS violations for different values ofT . As shown, very small values
of T result in a substantially higher search overhead and increase the time to converge to the
correct allocation, resulting in higher QoS violations. Moderate step sizes ranging from 3%
to as large as 10% seem to provide comparable performance. Tostrike a balance between
fine-grain allocation and low learning (search) overheads,we suggest step sizes ranging
96
![Page 114: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/114.jpg)
from 3-7%. Essentially, the step size should be sufficientlylarge to result in a noticeable
improvement in the response times of borrowers but not largeenough to adversely affect a
lender class (by reclaiming too much bandwidth).
Finally, we study the impact of varying the history sizek on the performance. We
choose = 0:5, T = 5% and varyk from 1 to 10 (we omit the graph due to space
constraints). Figure 4.7(c) plots the cumulative QoS violations normalized by the history
size with the least violations. Initially, increasing the history size results in a small decrease
in the number of QoS violations, indicating that additionalhistory allows the system to
make better decisions. However, increasing the history size beyond 5 does not yield any
additional improvement. This indicates that storing a small number of high reward actions
is sufficient, and that it is not necessary to store the rewardfor every possible action, as in
the naive learning technique, to make informed decisions. Using a small value ofk also
yields a substantial reduction in the memory requirements of the learning approach.
4.4.5 Implementation Experiments
We now demonstrate the effectiveness of our approach by conducting experiments on
our Linux prototype. As discussed in Section 4.3, our prototype consists of a 8 disk sys-
tem, configured as RAID-0 using the software RAID driver in Linux. We construct three
volumes on this array, each corresponding to an applicationclass. We use a a mix of three
different applications in our study, each of which belongs to a different class: (1)Post-
greSQL database server:We use the publicly available PostgreSQL database server version
7.2.3 and thepgbench 1.2benchmark. This benchmark emulates the TPC-B transactional
benchmark and provides control over the number of concurrent clients as well as the num-
ber of transactions performed by each client. The benchmarkgenerates a write-intensive
workload with small writes. (2)MPEG Streaming Media Server:We use a home-grown
MPEG-1 streaming media server to stream a 90 minute videos tomultiple clients over UDP.
Each video has a constant bit rate of 2.34 Mb/s and represent asequential workload with
97
![Page 115: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/115.jpg)
large reads. (3)Apache Web Server:We use the Apache web server and the publicly avail-
able SURGE web workload generator to generate web workloads. We configure SURGE
to generate a workload that emulates 300 time-sharing usersaccessing a 2.3 GB data-set
with 100,000 files. We use the default settings in SURGE for the file size distribution,
request size distributions, file popularity, temporal locality and idle periods of users. The
resulting workload is largely read-only and consists of small to medium size reads. Each of
the above application is assumed to belong to separate application class. To ensure that our
results are not skewed by a largely empty disk array, we populated the array with a variety
of other large and small files so that 50% of the 144GB storage space was utilized. We
choose = 0:5, T = 5%, k = 5 and a recomputation periodP = 5s. The target response
times of the three classes are set to 40ms, 50ms and 30ms, respectively.
We conduct a 10 minute experiment where the workload in the streaming server is fixed
to 2 concurrent clients (total I/O rate of 4.6 Mb/s). The database server is lightly loaded in
the first half of the experiment and we gradually increase theload on the Apache web server
(by starting a new instance of the SURGE client every minute;each new client represents
300 additional concurrent users). Att = 5 minutes, the load on the web server reverts to the
initial load (a single SURGE client). For the second half of the experiment, we introduce
a heavy database workload by configuring pgbench to emulate 20 concurrent users each
performing 500 transactions (thereby introducing a write-intensive workload).
Figure 4.8(a) plots the cumulative QoS violations observedover the duration of the
experiment for our learning technique and the static allocation technique. As shown, for the
first half of the experiments, there are no QoS violations, since there is sufficient bandwidth
capacity to meet the needs of all classes. The arrival of a heavy database workload triggers
a reallocation in the learning approach and allows the system to adapt to this change. The
static scheme is unable to adapt and incurs a significantly larger number of violations.
Figure 4.8(b) plots the time-series of the response times for the database server. As shown,
the adaptive nature of the learning approach enables it to provide better response times to
98
![Page 116: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/116.jpg)
0
50
100
150
200
250
300
350
400
450
200 400
Cum
ulat
ive
QoS
Vio
latio
ns
Time (secs.)
StaticLearning
0
20
40
60
80
100
120
200 400
Ave
rage
Res
pons
e T
ime
(ms)
Time (secs.)
LearningStatic
Target Response Time
(a) Cumulative QoS violations (b) Database Server
Figure 4.8.Results from our prototype implementation.
the database server. While the learning technique providescomparable or better response
time than static allocation for the web server, we see that both approaches are able to
meet the target response time requirements (due to the lightweb workload in the second
half, the observed response times are also very small). We observe a similar behavior for
the web server and the streaming server. As mentioned before, learning could perform
worse at some instants, either if it is exploring the allocation space or due to a sudden
workload change, and it requires a short period to readjust the allocations. In figure 4.8(b)
this happens aroundt = 400 s when learning performs worse than static, but the approach
quickly takes corrective action and gives better performance.
Overall, the behavior of our prototype implementation is consistent with our simulation
results.
4.4.6 Implementation Overheads
Our final experiment measures the implementation overheadsof our learning-based
bandwidth allocator. To do so, we vary the number of disks in the system from 50 to 500,
in steps of 50, and measure the memory and CPU requirements ofour bandwidth allocator.
Observe that since we are constrained by a 8 disk system, we emulate a large storage
99
![Page 117: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/117.jpg)
system by simply replicating the response times observed ata single disk and reporting
these values for all emulated disks. From the perspective ofthe bandwidth allocator, the
setup is no different from one where these disks actually exist in the system. Further, since
the allocations on each disk is computed independently, such a strategy accurately measures
the memory and CPU overheads of our technique. We assume thatnew allocations are
computed once every 5s.
We find that the CPU requirement for our approach to be less than 0.1% even for sys-
tems with 500 disks, indicating that the CPU overheads of thelearning approach is negli-
gible. The memory overheads of the allocation daemon are also small, with the percentage
of memory used on a server with 1 GB RAM varies (almost linearly) from 1 MB (0.1 %)
for a 50 disk system to 7 MB (0.7 %) for a 500 disk system. We notethat this memory
usage is for an untuned version of our allocator where we maintain numerous additional
statistics for conducting our experiments; the actual memory requirements will be smaller
than those reported here, indicating that the technique canbe used in practical systems.
Finally, note that the system call overheads of querying response times and conveying
the new allocations to the disk scheduler can be substantialin a 500 disk system (this
involves 1000 system calls every 5 seconds, two for each disk). However, observe that, the
bandwidth allocator was implemented in user-space for easeof debugging; the functionality
can be easily migrated into kernel-space, thereby eliminating this system call overhead.
Overall, our results demonstrate the feasibility of using areinforcement learning approach
for dynamic storage bandwidth allocation in large storage systems.
4.5 Related Work
The design of a self-managing storage systems involves several sub-tasks and issues
such self-configuration [7, 10] , capacity planning [14], automatic RAID-level selection
[11], initial storage system configuration [9], SAN fabric design [59] and on-line data mi-
100
![Page 118: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/118.jpg)
0
0.2
0.4
0.6
0.8
1
50 100 150 200 250 300 350 400 450 500
Mem
ory
Usa
ge: P
erce
ntag
e of
1 G
B R
AM
No. of Disks
Memory Usage
Figure 4.9.Memory overheads of the bandwidth allocator.
gration [39]. These efforts are complementary to our work which focuses on automatic
storage bandwidth allocation to applications with varyingworkloads.
Several other approaches ranging from control theory to online measurements and op-
timizations can also be employed to address the problem of dynamic bandwidth allocation
in storage systems. Subsequent to our work [57], control-theory and measurement-based
techniques [33, 40] have been proposed for managing storagebandwidth. Control theory
based techniques [3] as well as online measurements and optimizations [12, 47] have also
been employed for dynamically allocating resources in web servers. Utility-based opti-
mization models for dynamic resource allocation in server clusters have been employed
in [19]. Feedback-based dynamic proportional share allocation to meet real-rate disk I/O
requirements have been studied in [48]. While many feedback-based methods involve ap-
proximations such as the assumption of a linear relationship between resource share and
response time, no such limitation exists for reinforcementlearning—due to their search-
based approach, such techniques can easily handle non-linearity in system behavior. Al-
ternative techniques based on linear programming also makethe linearity assumption, and
need a linear objective function which is minimized; such a linear formulation may not be
101
![Page 119: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/119.jpg)
possible or might turn out to be inaccurate in practice. On the other hand, a hill-climbing
based approach can handle non-linearity, but can get stuck in local maxima.
Finally, reinforcement learning has also been used to address other systems issues such
as dynamic channel allocation in cellular telephone systems [55] and adaptive link alloca-
tion in ATM networks [44].
4.6 Concluding Remarks
In this chapter, we addressed the problem of dynamic allocation of storage bandwidth
to application classes so as to meet their response time requirements. We presented an
approach based on reinforcement learning to address this problem. We argued that a sim-
ple learning-based approach is not practical since it incurs significant memory and search
space overheads. To address this issue, we used application-specific knowledge to design
an efficient, practical learning-based technique for dynamic bandwidth allocation. To ad-
dress this issue, we used application-specific knowledge todesign an efficient, practical
learning-based technique for dynamic storage bandwidth allocation. Our approach can
react to dynamically changing workloads, provide isolation to application classes and is
stable under overload. Further, our technique learns online and does not require anya
priori training. Unlike other feedback-based models, an additional advantage of our tech-
nique is that it can easily handle complex non-linearity in the system behavior. We have
implemented our techniques into the Linux kernel and evaluated it using prototype experi-
mentation and trace-driven simulations. Our results show that (i) the use of learning enables
the storage system to reduce the number of QoS violations by afactor of 2.1 and (ii) the im-
plementation overheads of employing such techniques in operating system kernels is small.
Overall, our work demonstrated the feasibility of using reinforcement learning techniques
for dynamic resource allocation in storage systems.
102
![Page 120: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/120.jpg)
CHAPTER 5
AUTOMATED OBJECT REMAPPING FOR LOAD BALANCINGLARGE SCALE STORAGE SYSTEMS
5.1 Introduction
In the last three chapters we looked at problems in the context of initial configuration
and short-term reconfiguration of storage systems. Some reconfiguration tasks need to be
executed infrequently and are necessitated by the aging of the storage system, long-term
workload changes, need for growth etc. In this chapter, we focus on one such long-term
reconfiguration task.
Suitable initial placement obviates the need for frequent reconfiguration. And auto-
mated bandwidth allocation, which uses controlled requestthrottling, helps extract good
performance from the system in the face of transient workload changes. Persistent work-
load changes, which stress the storage system and result inhotspots, would deem it neces-
sary that the mapping of storage objects to arrays be tuned toensure agreeable performance.
5.1.1 Motivation
As mentioned in Chapter 2, in storage systems, object placement—the mapping of stor-
age objects to storage devices—is crucial as it dictates theperformance of the storage sys-
tem. Consequently, extreme care is taken during capacity planning and initial configuration
of such systems [9, 14].
Although the initial configuration may be load-balanced, over time, growth in storage
space usage and changes in workload can cause load imbalances and workload hotspots.
This in turn may necessitate a reconfiguration.
103
![Page 121: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/121.jpg)
Hotspots in storage systems can occur for one of two reasons.Incorrect or insufficient
workload information during storage system configuration may result in heavily accessed
objects being mapped to the same set of storage devices thus resulting in hotspots. Long
term workload changes or addition of a new object to a balanced system may also induce
hotspots1. Hotspots result in increased response times and a loss in throughput for ap-
plications accessing the storage system. When hotspots do occur, the mapping of objects
to storage devices needs to be revisited to ensure that the bandwidth utilization of all de-
vices is below a certain threshold so that applications see acceptable performance. Such a
reconfiguration is undesirable because it is concomitant with a downtime or a potential per-
formance impact on the applications accessing the storage system while the reconfiguration
is in progress.
Sophisticated enterprise storage sub-systems come with tools to facilitate the process
of load-balancing to address hotspots [2]. These allow for load-balancing to be either car-
ried out manually or in an automated fashion. For manual reconfiguration, administrators
use information from aworkload analyzercomponent which collects performance data and
summarizes the load on the component storage devices. The tool also provides the potential
performance impact of moving an object so that the user can make an informed decision.
The automated load balancing component, on the other hand, is self-driven, runs continu-
ously, and uses the information from the workload analyzer to swap hot and cold objects
when necessary.
Drawbacks of a manual process are that they require human oversight. Moreover, the
procedure can be error-prone and human errors during the reconfiguration process may
worsen performance. While an automated process addresses these drawbacks, a simple
approach which swaps hot and cold objects will work all the time only if objects are of
similar size. If objects are of different sizes then more sophisticated strategies are required.
1Note that this can happen irrespective of whether the systemis narrow-striped or wide-striped
104
![Page 122: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/122.jpg)
This motivates the need for more sophisticated approaches thatsearchfor a configuration
with no hotspots.
Moving the system to a new configuration involves executing amigration plan, which
is a sequence of object moves. The reconfiguration itself could be carried out eitheronline
or offline. In both cases, the scale of the reconfiguration i.e., the amount of data that needs
to be displaced, is of consequence. While for an offline reconfiguration the scale of the
reconfiguration determines the duration of the reconfiguration and hence the downtime, for
an online reconfiguration it determines the duration of performance impact on foreground
applications.
Existing approaches do not optimize for the scale of the reconfiguration, possibly mov-
ing much more data than required to remove the hotspot. This motivates the need for a
load-balancing approach that takes sizes of objects and their current mapping to storage
devices into account. This is the subject matter of this chapter.
5.1.2 Research Contributions
In this chapter, we develop algorithms to minimize the amount of data displaced during
a reconfiguration to remove hotspots in large-scale storagesystems.
Rather than identifying a new configuration from scratch, which may entail significant
data movement, our novel approach uses the current object configuration as ahint; the goal
being to retain most of the objects in place and thus limit thescale of the reconfiguration.
The key idea in our approach is togreedilydisplace excess bandwidth from overloaded
to underloaded storage devices. This is achieved in one of two ways, (i)displace, which
involves reassigning objects from overloaded devices to underloaded ones, and (ii)swap,
which involves swapping objects between overloaded and underloaded devices. The swap
step is useful when the spare storage space on the underloaded devices is insufficient to
accommodate any additional objects, and an object reconfiguration, short of a reconfigura-
105
![Page 123: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/123.jpg)
tion from scratch, would have to entail a swapping of objects, or groups of objects, between
storage devices.
To minimize the amount of data that needs to be moved we use thebandwidth to space
ratio (BSR) as a guiding metric. For example, by selecting highBSRobjects for reassign-
ment in the displace step, we are able to displace more bandwidth per unit of data moved.
Here, bandwidth (space) refers to the bandwidth (storage space) requirement of the storage
object. We propose various optimizations, including searching for multiple solutions, to
counter the pitfalls of a greedy approach.
We also describe a simple measurement-based technique for identifying hotspots and
for approximating per-object bandwidth requirements.
Finally, we evaluate our techniques using a combination of simulation studies and an
evaluation of an implementation in the Linux kernel. Results from the simulation study
suggest that for a variety of system configurations our novelapproach reduces the amount
of data moved to remove the hotspot by a factor of two as compared to other approaches.
The gains increase for a larger system size and magnitude of overload. Experimental results
from the prototype evaluation suggest that our measurementtechniques correctly identify
workload hotspots. For some simple overload configurationsconsidered in the prototype
our approach identifies a load-balanced configuration whichminimizes the amount of data
moved. Moreover, the kernel enhancements do not result in any noticeable degradation in
application performance.
The rest of the chapter is structured as follows. In Section 5.2, we describe the prob-
lem addressed in this chapter. Section 5.3 presents object remapping techniques for load-
balancing large scale storage systems. Section 5.4 presents the methodology used for mea-
suring object bandwidth requirements and for identifying hotspots. Section 5.5 presents the
details of out prototype implementation and Section 5.6 presents the experimental results.
Section 5.7 discusses related work, and finally, Section 5.8presents our conclusions.
106
![Page 124: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/124.jpg)
5.2 Problem Definition
5.2.1 System Model
Large scale storage systems consists of a large number of disk arrays. We assume,
as is typically the case, that each disk array consists of disks of the same type. Different
disk arrays, however, could have disks of different types. The disks in a disk array are
grouped into some number oflogical units(LU); an LU is a set of disks combined using
RAID techniques [45].
An object configuration indicates the mapping of storage objects to storage devices.
Here, an object is an equivalent of alogical volume(LV ), such as a database or a file
system, and is allocated storage space by concatenating space from one or moreLUs. From
here on, we use the termsLV and object interchangeably.
In our model, we assume that all theLUs anLV is striped over are similar i.e., they
have the sameRAID level, and comprise disks of the same type. This is generallytrue in
practice, since it ensures the same level of redundancy and similar access latency for all the
stripe units of anLV . We further make the simplifying assumption that if any twoLVs have
anLU in common, they have all their componentLUs in common. This assumption is also
generally true in well-planned storage system configurations, as in such a configuration
each object is subject to uniform inter-object workload interference on all of its component
LUs. With this assumption, the set ofLUs anLV is striped over can be thought of as a single
logical devicefor load balancing purposes. From here on, we refer to such a logical device
as alogical arrayor array for short. Figure 5.1 illustrates the system model.
5.2.2 Problem Formulation
Assuming the above system model, let us now formulate the problem addressed in this
chapter. Consider a storage system which consists ofn arrays,A1; A2::::An. There aremLVs,L1; L2::::Lm which populate the storage system. EachLV is mapped to a single array.
Each arrayAj, has storage capacitySj, and a bandwidth capacityBj. Similarly, eachLV
107
![Page 125: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/125.jpg)
Logical Arrays
Disk Arrays Logical Units
Logical Volumes
The figure shows two disk arrays each comprising fourLUs. EachLU consists of five disks. Thedisk array on the top comprises of one logical array over which three LVs have been striped. Thesecond disk array comprises of two logical arrays, each comprising two LUs and three LVs stripedover each logical array.
Figure 5.1. System model.
108
![Page 126: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/126.jpg)
Li has storage requirementsi and bandwidth requirementbi. For abalanced configuration,
which is defined to be a configuration without any hotspots, itis required that the percentage
bandwidth utilization of each arrayAj, not exceed some threshold�j(0 < �j < 1). The
space and the bandwidth constraint on an arrayAj is given by the following equations:�isi ij � Sj (5.1)�ibi ij � �j � Bj (5.2)
Here, ij is a mapping parameter that denotes whether objecti is mapped to arrayj— ijequals 1 if arrayj holds the objecti, and is 0 otherwise. Although the space constraint is a
hard constraint and cannot be violated, an array may observea violation of the bandwidth
constraint if the bandwidth requirements of the objects mapped to the array increase. If
the bandwidth utilization on an array exceeds the corresponding bandwidth threshold, it is
consideredoverloaded, otherwise it isunderloaded.
Moving the system to a new configuration results in a change ofmapping parameters.
Let oldij and newij denote the mapping parameter forLV i on arrayj in the old and new
configurations, respectively. Ifjxj denotes the absolute value ofx, then we have�jj newij � oldij j = 2 if the mapping of storei has changed, and is equal to 0 otherwise. Thecost of the
reconfiguration, defined as the amount of data moved to realize the new configuration, is
then given by: Cost = �i�jj newij � oldij j � si=2 (5.3)
Let O be the set of overloaded arrays in a configuration. The bandwidth violation for an
overloaded arrayj is (�ibi ij=Bj � �j). The cumulative bandwidth violation, defined as
the sum of the bandwidth violation over all overloaded arrays, is then given by:Overload = �j"O(�ibi ij=Bj � �j) (5.4)
109
![Page 127: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/127.jpg)
Given an object configuration with some overloaded arrays and some underloaded arrays,
the goal is to identify a balanced configuration which can be realized at the least cost. Given
two new configurations, both of which satisfy the space and bandwidth constraints on all
arrays, the one that can be realized at a lower cost is preferable.
For cases where a balanced configuration cannot be found, thegoal of load balancing is
a policy decision. One may require thatOverload (equation 5.4) be minimized, but when
displacing excess bandwidth from overloaded arrays, the bandwidth constraint on the un-
derloaded arrays should not be violated and that the utilization on an already overloaded
array should not increase further. In some cases, absolute load balancing may be desir-
able, thus requiring that the maximum percentage bandwidthviolation across all arrays be
minimized. We refer to the approaches that adhere to the the former policy asfair, and
the ones that conform with the latter asabsolute. Another dimension in this context is the
cost. Absolute load balancing may incur a significantly higher cost. A complete evaluation
of the tradeoffs of gains (balance achieved) versus cost (amount of data moved) of these
policies is beyond the scope of this thesis.
In this chapter, the goal is to design a reconfiguration algorithm for identifying a bal-
anced configuration which has the least cost.
5.3 Object Remapping Techniques
There are two kinds of approaches to load balancing, (i) those that reconfigure from
scratch, and (ii) those that start with the current configuration and aim to minimize the cost
of reconfiguration. We refer to the former class of approaches ascost obliviousand the
latter ascost aware.
In the following, an assignment of an object to an array is said to bevalid if the new
object could be accommodated on the array without any constraint violations (equations
5.1 and 5.2).
110
![Page 128: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/128.jpg)
5.3.1 Cost Oblivious Object Remapping
In this section, we present two cost-oblivious object remapping algorithms to remove
hotspots in large scale storage systems. We first present a randomized algorithm, and then
another, which is deterministic in nature.
5.3.1.1 Randomized Packing
Heuristics based on best-fit bin-packing have been used in [9] for initial storage system
configuration. There the goal was to identify a configurationwhich uses the least number
of devices to meet the space and bandwidth requirements of a given set of objects. In our
problem, the number of devices is a given, and the goal is to identify a valid packing which
can be realized at the least cost. We first present a randomized algorithm and then present
two variations of the same.
Initially, all the objects are unassigned. A random permutation of the objects is created,
and the objects are assigned to arrays picked at random from the set of all arrays. All
arrays may need to be tried for an object in the worst case. If all the objects could be
validly assigned to some array, we have a balanced configuration. The procedure could be
repeated multiple times, with different permutations of objects, and of multiple trials which
result in a balanced configuration, one with the least cost ischosen. Note that this makes
the approachsemi cost aware.
As opposed to the completely randomized approach, where both the objects and the
arrays are chosen randomly, two partly randomized variantsof interest are described next.
In a best-fit version, of all possible valid assignments for an object, the object is assigned to
the array such that new bandwidth utilization across all arrays as a result of this assignment
is a maximum. A complementary approach is also possible, worst fit, where of all possible
valid assignments, the array picked is such that the new bandwidth utilization across all ar-
rays as a result of this assignment is a minimum. Whereas best-fit may fare better in finding
a balanced configuration in bandwidth constrained scenarios, worst fit may yield a config-
111
![Page 129: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/129.jpg)
uration with similar bandwidth utilization on all arrays. Consequently, in less bandwidth
constrained scenarios, when the arrays have utilization values well below their correspond-
ing threshold, worst fit may be advantageous, since with moreheadroom, arrays can absorb
workload variations better.
5.3.1.2 BSR-based Approach
Bandwidth to Space Ratio(BSR) has been used as a metric for video placement [24].
These derive from the heuristics based on value per unit weight used for knapsack prob-
lems. The knapsack heuristic involves greedily selecting items ordered by their value per
unit weight in order to maximize the value of the items in the knapsack. The approach
described next usesBSRas a guiding metric, but as explained later, for slightly different
reasons.
The BSRof an object is defined to be the ratio of its bandwidth requirement to its
space requirement. We define thespareBSRof an array as the ratio of its spare bandwidth
capacity to its spare space capacity. So, thespareBSRof an array is a dynamic quantity
which depends on the objects currently assigned to it.
Initially, all the objects are unassigned. Objects are picked in order of theirBSRfrom the
set of all objects and assigned to arrays picked in order of their spareBSRfrom the set of all
arrays. If a valid assignment is found for all the objects, wehave a balanced configuration.
Note that thespareBSRof the array an object is assigned to is updated appropriately after
each valid assignment.
The intuition behind usingBSRas a metric is that assigning highBSRobjects to arrays
with a highspareBSRpossibly results in a better utilization of bandwidth per unit space
in the system, and hence a tighter packing. A tighter packingincreases the likelihood of
finding a balanced configuration.
112
![Page 130: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/130.jpg)
5.3.2 Cost-aware Object Remapping
In this section, we present two cost aware algorithms for searching a balanced configu-
ration. The first of these is a randomized algorithm and the second is a deterministic greedy
algorithm. Both approaches start with the current configuration and change the mapping
of the objects incrementally until a balanced configurationis achieved. Thus, these ap-
proaches use the current configuration as ahint, and aim to retain most of the objects in
place, possibly resulting in a lower cost of reconfiguration.
5.3.2.1 Randomized Object Reassignment
This approach is similar in principle to the randomized approach described in Section
5.3.1.1, except it starts with the current configuration. Given the current configuration,
a random permutation of objects on all the overloaded arraysis created. These objects
are then assigned, in order, to underloaded arrays picked atrandom from the set of all
underloaded arrays. It is possible that all the underloadedarrays need to be tried before a
valid assignment is found for an object. This is done until a fractionfra of objects have
been considered, or the system has reached a balanced configuration.
Once an overloaded array becomes underloaded, the objects on the now underloaded
array are not considered for reassignment. The overloaded array is now considered as
underloaded for load balancing purposes. This procedure could be repeated multiple times,
with different permutations, and of multiple trials which result in a balanced configuration,
the one with the least cost is chosen.
Again, as opposed to a completely randomized approach, there is a best-fit and a worst
fit variant of the algorithm. The variants are similar to thatdescribed for the approach in
Section 5.3.1.1.
Drawbacks: In Section 5.3.1 we presented two approaches which did not take the current
configuration into account while searching for a balanced configuration and are typically
associated with a large cost of reconfiguration. However, they are useful for initial storage
113
![Page 131: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/131.jpg)
system configuration. The randomized object reassignment approach described above starts
with the current configuration. This approach, however, also has two drawbacks:� Possibly high reconfiguration cost: In this approach, the object which is to be reas-
signed, is picked at random. Since, the search is not exhaustive it can still result in a
large amount of data being moved or may fail to find a balanced configuration.
Even though an exhaustive search is not feasible, choosing an object as well as the
array to which it is to be assigned carefully, taking into account their respective space
and bandwidth attributes, could be beneficial.� Simple reassignment: If the storage system does not have the right combination of
spare space and spare bandwidth on the constituent arrays, asimple reassignment
of objects may not yield a balanced configuration. Barring a reconfiguration from
scratch, which may entail a high cost, a low cost reconfiguration in such scenarios
would have to involveswappingobjects, or groups of objects, between arrays. The
diverse space and bandwidth requirements of the objects, coupled with diverse space
and bandwidth constraints on arrays that comprise the storage system, makes this
non-trivial.
In the following section, we present an approach to address these drawbacks.
5.3.2.2 Displace and Swap
The key idea in this approach is togreedilydisplace excess bandwidth from overloaded
arrays to underloaded arrays. The goal is to identify a set ofobjects, while taking into
account their sizes, that need to be moved from their original location in order to attain a
balanced configuration.BSRis used as a guiding metric in order to minimize the amount
of data that needs to be displaced. Here, by object size we refer to the storage space
requirement of an object.
There are two basic steps which comprise this approach. The first is referred to as
displaceand involves reassigning objects from overloaded arrays tounderloaded arrays.
114
![Page 132: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/132.jpg)
The second step, referred to asswap, involves swapping objects between overloaded and
underloaded arrays. The second step is invoked only if the first step alone does not yield
a balanced configuration. The goal is to first offload as much excess bandwidth on an
overloaded array using one way object moves (displace), andif this does not suffice, search
for two way object moves (swap). The intuition is that one wayobject moves, on the
average, would require less data movement than a solution involving two way object moves.
One way object moves are also preferable to two way object moves as they do not require
any scratch space2 to achieve the reconfiguration.
Displace: In this step, the goal is to use any spare space on the underloaded arrays
to accommodate objects from overloaded arrays and thus offload excess bandwidth. Only
underloaded arrays with spare space are considered as potential destinations during object
reassignment.
Since the goal is to remove excess bandwidth from each overload array while moving
the least amount of data, we consider objects from each overloaded array one by one. This
allows us to optimize for the amount of data displaced from each overloaded array.
The overloaded arrays themselves could be considered in anyorder. To achieve a bal-
anced configuration, the bandwidth utilization on all the overloaded arrays needs to be
reduced below the corresponding threshold. So, we considerthe overloaded arrays in de-
scending order of the magnitude of bandwidth violation (�ibi ij � �j � Bj). This has the
advantage that if the displace step is unable to identify a balanced configuration, there is
less bandwidth that needs to be moved off each overloaded array, on the average, in the
swap step.
Finally, for a given overloaded array, objects on the array are considered for reassign-
ment in descending order of theirBSR. This is in order to minimize the amount of data
displaced, as of all objects, the object with the maximumBSRdisplaces the most band-
2Swapping objects between arrays with little spare storage space may require using scratch storage space.
115
![Page 133: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/133.jpg)
width per unit of data moved. The destination underloaded array for reassigning an object
is chosen to be the one with the maximumspareBSR. The reason is similar to that for the
approach in Section 5.3.1.2. This completes the essence of the displace step.
Object reassignments that remove the hotspot on an overloaded array could be single-
object or multi-object. Any valid single object reassignment that can remove the hotspot
on the overloaded array is referred to as asoloSoln. Any reassignment comprising multiple
objects that removes the hotspot is referred to as agrpSoln. Any reassignment comprising
one or more objects that is not able to remove the hotspot is referred to as asemiSoln. We
refer to bothgrpSolnandsemiSolnassolnfor short.
It is possible that choosing objects for reassignment strictly in order ofBSR, as described
above, results in asoloSolnappearing as a part of agrpSoln. So, we identify allsoloSolns
before searching forgrpSolns.� Identifying a soloSoln: Any object on the overloaded array that can be validly
assigned to some underloaded array, and also removes the hotspot, classifies as a
soloSoln. Any object that can be validly assigned, but does not removethe hotspot,
is put in a setR. The setR, which is devoid ofsoloSolns, is used to identifygrpSolns
in the next step.
A minor optimization is possible here. If the setR consists of objects all of which
are larger than the smallest sizesoloSoln, there is no need to execute the following.
This is because anygrpSolnwould only have a higher cost.� Identifying a soln: In this step we search for agrpSolnusingBSRas the guiding
metric. Given a setR of objects, objects picked in descending order ofBSRare
assigned to underloaded arrays chosen in descending order of spareBSR. This is done
until either all the objects on the overloaded array have been considered, or the set of
reassignments so far is able to remove the hotspot. If the hotspot could be removed,
we have agrpSoln, else we have asemiSoln.
116
![Page 134: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/134.jpg)
In the above step, for identifying asoln, the objects were selected greedily based on their
BSR. However, such a greedy approach could make some wrong choices. These could result
in (i) a higher cost solution, or (ii) inability to remove thehotspot on the overloaded array.
While an exhaustive search is infeasible, the following optimizations try to address at least
some of the wrong choices.
These optimizations essentially involve questioning the choice of each object that com-
prises thesoln. Any soln can be thought of as being comprised of two parts. One, the
highestBSRobject, referred to as theroot. All the remaining objects in thesoln, if any,
comprise the second part. Whereas the first optimization questions the choice of the root,
the second questions the choice of each of the remaining objects that comprise thesoln.
Also, while improving agrpSolnrequires finding another with a lower cost, improving
asemiSolnmeans finding agrpSolnor anothersemiSolnwhich displaces more bandwidth.� Optimization 1: Identifying multiple solns . In this optimization, we identifysolns
with different elements in the setR as root. Note, that for a given root only objects
with a lowerBSRthan the root are considered for reassignment. This optimization
gives us multiplesolns. The number of suchsolns equals the number of objects in
the setR.� Optimization 2: Backtracking . This optimization involves backtracking on the
remaining objects that comprise asoln. We employ this optimization to improve each
of the solns identified in the optimization above. Each backtracking step involves
searching for a newsolnwhile not considering an object that is part of thesoln. This
is done for all the objects that comprise thesolnexcepting the root (the root has been
optimized for in the previous step).
If backtracking results in a bettersoln, backtracking on the previoussoln is discon-
tinued and restarted for this newsoln. It is possible that this procedure continues to
117
![Page 135: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/135.jpg)
yield successively bettersolns. To limit the computational costs we explore only a
constant number of these.
Note that the above optimizations result in a strategy whichlies somewhere between a
purely greedy approach and one that exhaustively considersevery combination.
If the above results in multiplegrpSolns or soloSolns, the one with the least cost is
chosen3. If, however, the above only results in object reassignments which reduce the
bandwidth violation on the overloaded array (i.e., onlysemiSolns), the one which displaces
the most bandwidth is chosen4. The mapping parameters ( ijs) for the objects to be re-
assigned are adjusted appropriately. Note that this modified configuration serves as the
starting configuration for the next overloaded array considered.
Once all the overloaded arrays have been considered, and thesystem is still not bal-
anced, the swap step, which is described next, is invoked.
Swap: Displace works only when there is sufficient storage space on the underloaded
arrays to accommodate objects from the overloaded arrays. In the absence of sufficient
spare space, a low cost reconfiguration technique would require swapping objects between
arrays. Such swaps could be two-way i.e., involve two arrays, or they could be multi-way.
In this chapter, we describe a strategy for identifying two-way swaps.
In this step, the goal is to identify valid swaps of objects, or groups of objects, between
overloaded and underloaded arrays, such that the bandwidthutilization on the overloaded
array is reduced. By successively identifying such swaps, we can remove the hotspot on an
overloaded array.
BSRis again used as the guiding metric. By swapping highBSRobjects on an overloaded
array with lowBSRobjects on a underloaded array, maximum bandwidth is displaced per
unit of data moved.
3Ties are broken by choosing the one which displaces the leastbandwidth as this leaves more sparebandwidth on the underloaded array to accommodate future object moves.
4In this case, ties are broken by choosing the semiSoln with the least cost.
118
![Page 136: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/136.jpg)
Swaps are searched for between a pair of an overloaded array and an underloaded array.
Since, all arrays need to be underloaded for a balanced configuration, overloaded arrays are
considered in descending order of the magnitude of bandwidth violation. For each over-
loaded array, underloaded arrays are considered in descending order of spare bandwidth.
This is done so that possibly maximum bandwidth is displacedfor each pair considered.
The diverse space and bandwidth attributes of the objects and arrays make identifying
valid swaps non-trivial, so we use a simple greedy approach guided by theBSRof the
objects. Before we describe how a swap is identified, let us define what classifies as a valid
swap.� Valid swap: While a swap is valid if it does not violate the constraints onthe un-
derloaded array and decreases the bandwidth utilization onthe overloaded array. It
is not useful if this decrease is not significant. So, we definea parameterufrac
which quantifies the utility of a swap. LetbwO and bwU be the cumulative band-
width requirement of the sets of objects from the overloadedand underloaded array,
respectively, which are to be swapped. Then for a swap to be valid we require that:bwO � bwU � ufra � bwU (5.5)
In other words, the decrease in bandwidth on the overloaded array as a fraction of the
bandwidth moved off the underloaded array should exceed a certain minimum.
We classify the constraints that need to be satisfied for a swap to be valid as follows:� Constraint C1: The swap should satisfy the bandwidth and space constraints on the
underloaded array.� Constraint C2: The swap should have a certain minimum utility (equation 5.5).� Constraint C3: The swap should satisfy the space constraint on the overloaded array.
119
![Page 137: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/137.jpg)
We now describe the approach for identifying a valid swap.� Identifying a valid swap: While simply considering all pairs of objects, one each
from an overloaded array and an underloaded array at a time may not result in any
valid swap, considering every combination of objects from the two arrays is infea-
sible. We present a simple greedy approach to swap the equivalent of a highBSR
object from the overloaded array with the equivalent of a lowBSRobject from the
underloaded array. Identifying such a swap also displaces more bandwidth per unit
of data moved.
To identify such a swap, objects on the overloaded and underloaded arrays are sorted
in descending and ascending order ofBSR, respectively, to give setsLOlv andLUlv,respectively. First pairs of objects from these two orderedsets are considered. Each
object inLUlv is considered, in order, for each object inLOlv, in order. If a pair meets
the constraints for a valid swap objects are swapped.
If after considering all pairs the array is still overloaded, we seek to identify contigu-
ous sets of objects from these two ordered sets which constitute a valid swap. Note,
that these contiguous sets of objects are the equivalent of ahigh BSRand lowBSR
object, respectively.
Ideally it is desirable that contiguous sets of objects fromthese two sets be identi-
fied. However, it is possible that no such sets can be identified that satisfy all the
constraints for a valid swap. So, in the procedure describedbelow we first (step 1)
identify contiguous sets that satisfy two of the constraints; if these contiguous sets do
not satisfy the third constraint, possibly non-contiguousobjects are picked in order
to meet the constraint (step 2).
Let LOsw andLUsw denote the sets of objects from the overloaded and underloaded
arrays, respectively, that are to be swapped.
120
![Page 138: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/138.jpg)
– Satisfy C1 and C2: Contiguous elements from the ordered setsLOlv andLUlv,respectively, are incrementally added to the setsLOsw andLUsw, respectively, untilC1 andC2 have been satisfied. This gives a valid swap ifC3 is also satisfied.
– Satisfy C3: If C3 has not been satisfied, additional objects from the setLOlv,picked in order, are added to the setLOsw; an object is added only if it does not
result in a violation ofC1 or C2. Objects are added untilC3 has been satisfied.
This may result inLOsw being comprised of non-contiguous elements from the
ordered setLOlv.– Given a valid swap, the ordered setsLOsw andLUsw are updated to reflect the
swap.
– If a valid swap was not found, the above steps are repeated butnow with the
second element in the ordered setLOlv as the first element added to setLOsw, and
so on.
– Swaps are searched for until the hotspot on the overloaded has been removed.
If after executing this step, there are no overloaded arrays, we have a balanced con-
figuration. Note that this simple greedy approach for swapping contiguous sets of objects
between two arrays may be sub-optimal; however, the parameterufracallows some control
over the utility of a swap. Figure 5.2 and the following example together illustrate displace
and swap.
Example The figure illustrates how displace and swap work. Figure (a)shows two ar-
rays with bandwidth utilizations of 100% and 40%, respectively. Each box with a number
indicates an object and an empty box indicates unallocated space. The number in a box
indicates the bandwidth requirement of the object. For simplicity, all objects are assumed
to be of unit size; so the bandwidth requirement of an object is also its BSR. The bandwidth
overload threshold� is assumed to be 75% for both the arrays. As Array 1 is overloaded the
displace and swapalgorithm proceeds as follows. The displace step is invokedfirst as the
121
![Page 139: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/139.jpg)
Array 1 Array 2 Array 1 Array 1Array 2 Array 2
10
20
70
100%
20
20
40%
Displace
70
10
20
20
70
10
After
Displace20
100% 40%
20
20
20
(a) (b) (c)Swap
After
Swap
(e) (d)
80% 60%
80% 60%70%70%
20
2020
20
20
70
10
70
10
20
Array 1 Array 1Array 2 Array 2
1 11
1
1
Figure 5.2. Illustration of Displace and Swap.
underloaded array has one unit spare space. Figures (b) and (c) illustrate an object being
moved from Array 1 to Array 2. The object selected is one with the BSR of 20. The object
with BSR 70 could not be accommodated on the underloaded array due to bandwidth con-
straints. After the displace step since Array 1 is still overloaded the swap step is invoked.
Figures (d) and (e) illustrate an object with BSR 10 being swapped with an object with
BSR 1. Note that first, pairs of objects are considered; the object with BSR 70 on Array 1
could not be swapped with any object on Array 2 without any constraint violations. Since,
both the arrays are now underloaded the algorithm terminates.
5.4 Measuring Bandwidth Requirements and Detecting Hotspots
In the previous section, we presented techniques for identifying a balanced configura-
tion. The techniques assume that the bandwidth requirementof the objects is known and so
a hotspot can be identified. In this section, we describe techniques for measuring bandwidth
requirements of objects and detecting hotspots in a real storage system.
122
![Page 140: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/140.jpg)
Measuring Bandwidth Requirements: Whereas the space requirement of an object
is fixed at object creation time5, the bandwidth requirement of an object depends on the
current workload. Unless the workload access pattern to theobject is well characterized and
knowna priori throughout the lifetime of the object, its bandwidth requirement needs to
be inferred based on online measurements. We use a simple measurement-based technique
to approximate the bandwidth requirement of each object.
Recall that each object is assumed to be striped across some number ofLUs in a logical
array. Given the request size (in sectors) and the first logical sector requested for each
request, one can infer the number of disks accessed. Note that the number of disks accessed
is upper bounded by the number of disks which comprise the logical array. This technique
requires that the number of disks each object is striped overand theRAID level of the
componentLUs be known. Given the average latency and transfer rate for the underlying
disk, if a requestreq results inIO ountreq independent disk accesses andSe torCountreqis the number of sectors requested, the percentage bandwidth utilization of a logical array
over a time windowI due to accesses to objectLi is given by:�req�(I;Li)(IO ountreq � (tseek + trot)+ Se torCountreq=rxfr)=(I � numDisks) (5.6)
Here, the summation is overreq � (I; Li) i.e., requests that accessed objectLi and com-
pleted in the time windowI. numDisksis the number of disks in the underlying logical array.tseek, trot andrxfr are the average seek time, average rotational latency and average transfer
rate, respectively, for the underlying storage device. Theabove expression computes the
diskhead busy time per unit time per disk due to requests accessing objectLi in a time
durationI, thus giving the array utilization due to accesses to the object.
5Note that the space requirement refers to the size of the corresponding logical volume and not the theactual storage space in use. Moreover, extending a logical volume is an infrequent operation and a consequentchange in space requirement is easily accommodated.
123
![Page 141: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/141.jpg)
We use this utilization figure as a measure of the bandwidth requirement of a an ob-
ject. Note that this is the perceived bandwidth requirementof the object and assumes that
the workload accessing the object is able to express itself in the presence of inter-object
interference i.e., accesses to other objects on the same logical array. Moving the object to
a similar array with less load may result in a different bandwidth utilization.
A limitation of our approach is that it works only for similarlogical arrays. An approach
used in practice is theIOPS measure for characterizing object bandwidth requirementsand
array bandwidth capacity. A limitation, however, of such a characterization is that it im-
plicitly assumes a basic transfer size or amount of data accessed for an IO. For objects with
different stripe unit sizes mapped to the same array such a technique may not be accurate.
Our approach does not have this drawback.
Identifying Hotspots: The above technique gives the bandwidth utilization on an array
due to an object mapped to it. The bandwidth utilization of the logical array can be now
be approximated as the summation of the bandwidth utilizations of all the objects mapped
to the array. An array is overloaded if its bandwidth utilization exceeds a certain threshold
(equation 5.2).
An approach which offers flexibility in defining a hotspot is one using percentiles. The
bandwidth utilization is averaged over an intervalI, and an overload is signaled if a per-
centile (perc) utilization over the samplesbWI in a time windowW, exceed the threshold.
Since the utilization for each logical volume is computed separately (equation 5.6), one can
compute this percentile for each logical volume and use their summation as the measure of
the bandwidth utilization of the array.
5.5 Implementation Considerations
We have implemented our techniques in the Linux kernel version 2.6.11. Our prototype
consists of two components: (i) kernel hooks to monitor IO completions for each logical
volume, and (ii) a user space reconfiguration module which uses statistics collected in the
124
![Page 142: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/142.jpg)
kernel to estimate bandwidth requirements, computes a new configuration if a hotspot is
detected, and migrates the requisiteLVs appropriately.
Our prototype was implemented on a Dell PowerEdge server with two 933 MHz Pen-
tium III processors and 1 GB memory that runs Fedora Core 2.0.The server contains an
Adaptec3410S U160 SCSI Raid Controller Card that is connected to two Dell PowerVault
disk packs which comprised 20 disks altogether; each disk isa 10,025 rpm Ultra-160SCSI
FujitsuMAN3184MC drive with 18 GB storage.
The kernel portion of the code involved adding appropriate code and data structures to
enable collecting statistics for eachLV . The 2.6 kernel usesbio as the basic descriptor for
IOs to a block device. On IO completion a routinebio endio is invoked by the device
interrupt handler. It is here that we do the bookkeeping for each LV separately. This is
facilitated as eachLV created using the Linuxlogical volume manager(LVM ) has a separate
device identifier; the device identifier for which the IO was performed is available in the
bio descriptor.
The user space reconfiguration module makes a system call periodically to query the
statistics from the kernel. The statistics are namely thesectorCountandIOCount(see Section
5.4) which are used to approximate the bandwidth requirement of an LV . The system call
also automatically resets the kernel statistics. We also provide two additional system calls
which allow selective enabling and disabling of statisticscollection for anLV . Statistics
collection is enabled by default for anLV when it is activated (inLVM terminology), and
is thus registered with the kernel. Deactivating anLV automatically disables the statistics
collection for the same. Finally, note that the implementation involved using appropriate
kernel synchronization primitives since the same data structure is accessed by the user
space reconfiguration module (via system calls) when querying statistics and by the device
interrupt handler on an IO completion. A separate synchronization primitive was employed
for each logical volume to improve concurrency.
125
![Page 143: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/143.jpg)
If the reconfiguration module detects a hotspot, it invokes appropriate routines to iden-
tify a balanced configuration. If a balanced configuration isfound the logical volumes
are migrated appropriately. We use tools provided by the Linux Logical Volume Man-
ager(LVM ), namelypvmove, to achieve data migration while theLVs are online and being
actively accessed. The user application continues to work uninterrupted throughout the
migration, except for possibly some performance impact while the reconfiguration is in
progress.
Finally, since we collect statistics only for IOs actually issued to the block device, any
hits in the buffer cache are transparently handled. Our current implementation does not
account for hits in other caches (disk cache and controller cache).
It is possible that disk accesses for separatebio requests get merged at the disk level.
This would mean that the value ofIOCountwould be overestimated. To account for this, in
our implementation, separatebio requests which correspond to contiguous logical sectors
and complete within a short time window are treated as one large request. This ensures that
the IOCountestimate is more in tune with the actual value.
5.6 Experimental Evaluation
In this Section, we first compare different object remappingtechniques using algorith-
mic simulations. We then present experimental results fromthe evaluation of our prototype
implementation. Since, our prototype is limited by the hardware configuration, algorith-
mic simulations help exhaustively evaluate the performance of different approaches for a
variety of system configurations.
5.6.1 Simulation Results
We used an algorithmic simulator to compare the different algorithms for object remap-
ping described in Section 5.3. The simulator implements allthe algorithms and when in-
voked for an imbalanced configuration reports the cost of reconfiguration for each.
126
![Page 144: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/144.jpg)
We seek to study the performance of different algorithms as different system parameters
are varied. The parameters varied were the system size, the initial system bandwidth and
space utilization, and the magnitude of the bandwidth overload. We also study the impact
of the optimizations developed for thedisplacealgorithm.
The default storage system configuration in our simulationscomprised four logical ar-
rays, each with 20 18 GB disks. Each logical array in the system was configured to have
an initial storage space and bandwidth utilization of 60% and 50%, respectively.
To achieve a specified storage space utilization on an array,objects were assigned to an
array until the desired space utilization had been reached.The object sizes were assumed
to be uniformly distributed in the range [1 GB,16 GB]; the object size was assumed to
be a multiple of 0.25 GB. To achieve a specified bandwidth utilization, first bandwidth
requirement values were generated, one for each object, andin proportion to the object
size. A random permutation of these values was then generated and a value assigned to
each object in the array. This procedure resulted in a configuration with the desired values
of storage space and bandwidth utilization for each array and no correlation between object
size and object bandwidth requirements. Note that the default system parameters resulted
on an average 25 objects assigned to each logical array, and thus an average of 100 objects
in the storage system (comprised of four arrays).
To generate an imbalanced configuration, we increased the bandwidth utilization on
half the arrays in the system until a desired magnitude of overload had been reached. This
resulted in a storage system with half the arrays overloadedand half with spare storage
bandwidth. Here,magnitude of overloadrefers to the average of the bandwidth violation
across all arrays in the system. To create an overload, we picked an object at random
from one of the arrays that is to be overloaded, and increasedits bandwidth requirement
by an amount�bw; for a given system configuration�bw was chosen to be the bandwidth
requirement of the least loaded object in the system. This procedure was repeated until
the desired magnitude of overload had been attained. For ourexperiments, the default
127
![Page 145: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/145.jpg)
bandwidth violation threshold was chosen to be 80%, and the default magnitude of overload
was fixed at 5%.
For each experiment, the performance figures reported correspond to an average over
100 runs i.e., correspond to the average cost of reconfiguration for 100 imbalanced config-
urations for the same choice of system parameters. Thenormalized data displacedfigure
reported in the following experiments is the total amount ofdata displaced (equation 5.3)
as a percentage of the total data in the system i.e., the storage space allocated to all the
objects put together.
5.6.1.1 Impact of System Size
In this experiment, we study the impact of the system size on the cost of reconfigu-
ration. We vary the system size i.e., the number of logical arrays, from 2 to 10. This
resulted in systems with number of disks ranging from 40 to 200. Figures 5.3(a) and 5.3(b)
show the performance of the cost-aware and cost-oblivious approaches, respectively, with
varying system size. The graphsRandom PackingandBSRcorrespond to the cost-oblivious
approaches presented in Sections 5.3.1.1 and 5.3.1.2, respectively. The graphsRandom Re-
assignandDSwapcorrespond to the cost-aware approaches presented in Sections 5.3.2.1
and 5.3.2.2, respectively.
Figure 5.3(a) shows thatDSwapoutperformsRandom Reassign. Moreover, while the
normalized reconfiguration cost with an increase in system size remains constant for the
former, it increases for the latter. The higher cost of reconfiguration for theRandom Reassign
algorithm is because of the randomized nature of the algorithm. With increasing system
size the number of possible objects to choose from goes up. The normalized cost remains
constant for theDSwapalgorithm as objects are chosen from an overloaded array carefully
based on theirBSRvalues, and so increasing the system size does not increase the cost.
Note, however, that the absolute amount of data displaced does increase with an increase
in the system size.
128
![Page 146: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/146.jpg)
0
1
2
3
4
5
2 4 6 8 10
Dat
a D
ispl
aced
(N
orm
aliz
ed)
Number of Arrays
Impact of System Size
Random ReassignDSwap
0
20
40
60
80
100
2 4 6 8 10
Dat
a D
ispl
aced
(N
orm
aliz
ed)
Number of Arrays
Impact of System Size
BSRRandom Packing
(a) Cost-aware (b) Cost-oblivious
Figure 5.3. Impact of system size.
Figure 5.3(b) shows the cost of the reconfiguration for the cost-oblivious approaches.
Since, both approaches reconfigure the system from scratch,the cost of reconfiguration is
significantly higher as compared to that of the cost-aware approaches. In both cases, the
cost of reconfiguration increases with an increase in the system size because the probability
that an object gets remapped to its original array decreases. Random Packinggives a cost
lower thanBSRbecause it is semi cost-aware (see Section 5.3.1.1). For therest of the
experiments described in this Section, the cost-obliviousapproaches resulted in a similarly
high cost of reconfiguration as compared to the cost-aware approaches and so we do not
present the results for the same.
In our experiments, for theRandom Reassignapproach we set the fraction of objectsfrac
(see Section 5.3.2.1) considered for reassignment from theset of objects on all the over-
loaded arrays to be 1.0 i.e., all the objects were considered. Also, for both the randomized
algorithms,Random ReassignandRandom Packing, the balanced configuration chosen was
the one with the least cost from among 100 runs with differentseed values.
129
![Page 147: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/147.jpg)
0
0.5
1
1.5
2
2.5
3
3.5
4
50 55 60 65
Dat
a D
ispl
aced
(N
orm
aliz
ed)
Array Bandwith Utilization (%)
Impact of Initial System Bandwith Utilization
Random ReassignDSwap
0
1
2
3
4
5
6
7
8
2 4 6 8 10
Dat
a D
ispl
aced
(N
orm
aliz
ed)
Mean Bandwidth Overload (%)
Impact of Average Bandwidth Overload
Random ReassignDSwap
(a) Initial bandwidth utilization (b) Magnitude of overload
Figure 5.4. Impact of bandwidth utilization.
5.6.1.2 Impact of System Bandwidth Utilization
In this experiment, we studied the impact of the bandwidth utilization on the cost of re-
configuration. Figure 5.4(a) and 5.4(b) show the impact of the initial bandwidth utilization
and the magnitude of bandwidth overload, respectively.
Figure 5.4(a) shows that as the initial bandwidth utilization is increased from 50% to
65%, the normalized cost remains unchanged for both approaches. This is because in-
creasing the initial system bandwidth utilization merely increases the initial bandwidth
requirement of all the objects in the system proportionately. This reduces the fraction of
objects that can be reassigned to the underloaded array without any constraint violations.
The normalized cost of reconfiguration, however, does not change significantly, as each ob-
ject reassignment, on the average, now displaces more bandwidth. Note thatDSwapresults
in a factor of two lower cost as compared to theRandom Reassignapproach due to reasons
similar to that described in the previous experiment.
Figure 5.4(b) shows that with an increase in the magnitude ofoverload from 2% to 10%,
the cost of reconfiguration increases for both approaches. This is because, on an average,
more data needs to be displaced for a higher magnitude of overload. The rate of increase
in the normalized cost is greater forRandom Reassign, as compared toDSwap, as the objects
130
![Page 148: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/148.jpg)
are chosen for reassignment at random in the former approach, while in the latter approach
objects are considered for reassignment based on theirBSRvalues.
5.6.1.3 Impact of System Space Utilization
In this experiment, we studied the impact of varying system space utilization on the cost
of reconfiguration. Figure 5.5(a) shows that the normalizedcost of reconfiguration remains
almost unchanged for both approaches as the the system spaceutilization is varied from
60% to 90%. This can be attributed to the fact that increasingthe system space utilization
increases the number of objects on each array. Consequently, for a fixed value of the initial
bandwidth utilization the bandwidth requirement of the objects on an array decreases with
an increase in the system space utilization. While this may require that more objects need
to be reassigned to remove the same bandwidth overload, an increase in the system space
utilization results in the normalized cost remaining largely unchanged. Note thatDSwap
results in a cost which is a factor of two less thanRandom Reassign.
We see a slight increase followed by a slight decrease in the cost for theRandom Reassign
approach. This is because an increase in the space utilization increases the number of
objects to choose from for reassignment. The slight decrease that follows is because at
higher space utilizations the number of objects that can be reassigned decreases as the space
constraints on the underloaded arrays become a significant factor. The slight decrease in the
cost for theDSwapapproach is because with an increase in the system space utilization the
fraction of objects on the overloaded arrays that can be accommodated on the underloaded
arrays decreases.
Figure 5.5(b) shows the percentage of times a balanced configuration was identified for
different imbalanced configurations generated for the samechoice of parameters. TheBSR
approach, which reconfigures from scratch, fails to find a balanced configuration all the
time when the system space utilization is 90% because of its deterministic nature.Random
Reassignfails at 95% system space utilization, as there is little spare storage space on the
131
![Page 149: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/149.jpg)
0
0.5
1
1.5
2
2.5
3
3.5
4
60 65 70 75 80 85 90
Dat
a D
ispl
aced
(N
orm
aliz
ed)
System Space Utilization (%)
Impact of System Space Utilization
Random ReassignDSwap
0
20
40
60
80
100
60 65 70 75 80 85 90 95 100
Per
cent
age
Sol
utio
n F
ound
System Space Utilization (%)
Impact of System Space Utilization
DSwapBest-fit Random Packing
Worst-fit Random PackingRandom Packing
Random ReassignBSR
(a) Cost (b) Percentage Solution Found
Figure 5.5. Impact of space utilization.
underloaded arrays; recall that this approach onlyreassignsobjects from overloaded ar-
rays to underloaded arrays. The best-fit and worst-fit variants of this algorithm performed
similarly, and so have not been shown in the figure.
At 100% system space utilizationRandom Packingand its variantWorst-fit Random Pack-
ing fail to find a balanced configuration all the time. Only,DSwapand Best-fit Random
Packingapproaches were able to find a balanced configuration for all overloaded configu-
rations. As expected the best-fit variant ofRandom Packingis able to identify a balanced
configuration in constrained scenarios (see Section 5.3.1.1). DSwapis able to identify a
balanced configuration as it swaps objects between arrays. Note that in these experiments
ufrac (equation 5.5) was chosen to be 0.50.
5.6.1.4 Impact of Optimizations
In this experiment, we study the impact of the optimizationsdeveloped for thedisplace
step of our approach (see Section 5.3.2.2). Here, we presentthe results in the context of
variation of system size. We conducted experiments to studythe impact of the optimiza-
tions when various system parameters were varied; the results were similar to that for the
system size experiment, and so we omit those to avoid repetition.
132
![Page 150: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/150.jpg)
0
0.5
1
1.5
2
2 4 6 8 10
Dat
a D
ispl
aced
(N
orm
aliz
ed)
Number of Arrays
Impact of Optimizations
DSwap NoOpt.DSwap Opt. 1
DSwap (Opt. 1 + Opt. 2)
Figure 5.6. Impact of optimizations.
Recall that the first optimization involved choosing from among multiple possible groups
of objects to remove the overload on a underloaded array. Thesecond optimization used
backtracking to improve thesolnfor each overloaded array.
Figure 5.6 shows the impact of the various optimizations as the system size is varied.
As can be seen in the figure,DSwapwithout any optimizations (NoOpt.) has the maximum
normalized cost. Introducing the first optimization (Opt. 1) results in a marginal improve-
ment in the cost. The improvement is more pronounced when both the optimizations (Opt.
1 + Opt. 2) are employed. This is because while the first optimization questions only the
choice of theroot of the thesoln, the second optimization uses backtracking to question
the choice of each of the subsequent objects that comprise the soln thus resulting in more
significant gains.
Note, however, that even this marginal improvement can be significant as the actual
amount of data that needs to be displaced can be significantlydifferent in the three cases
as the system size is increased. Finally, the role of the optimizations can be particularly
significant as compared to a purely greedy approach for some specific imbalanced config-
urations.
133
![Page 151: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/151.jpg)
5.6.2 Prototype Evaluation
In this Section, we demonstrate the effectiveness of our approach by conducting exper-
iments on our Linux prototype. The goal in these experimentswas two-fold; (i) to show
that the kernel measurement techniques are able to identifya hotspot, and (ii) to demon-
strate that the reconfiguration module makes the correct choice when selecting objects and
underloaded arrays to remove the hotspot.
In our experiments, we use a simple synthetic workload whichprovides us a great
degree of control in imposing a desirable amount of IO load onthe storage system. The
workload for each logical volume was defined using two parameters, (i) concurrencyN ,
and (ii) mean think timeIA. A workload with concurrencyN consists ofN concurrent
clients; each client issues a request and sleeps for a time interval exponentially distributed
with a mean ofIA on request completion before issuing a new request. The request sizes
were assumed to be fixed and successive requests access random locations on the logical
volume. The request size for client requests was fixed at 16KBin our experiments. Note,
that while the think time provided control over the load imposed by each client, the client
concurrency allowed us to independently control the load being imposed on an array due
to accesses to anLV .
The characteristics of the host and the storage system in ourprototype were as described
in Section 5.5. We partitioned the 20 disks in the system to give five striped logical arrays
each comprising four disks and with a stripe unit size of 16KB. Each array was partitioned
into 14 partitions of size four GB each6. These partitions served as building blocks (in the
form of LVM physical volumes) for theLVs created on each array; so, theLV size was a
multiple of 4 GB. Of the five logical arrays, one array was configured without any logical
volumes and was used as scratch space when swapping logical volumes between arrays.
6Note that each striped array is visible as a SCSI drive on the host. Since we wanted to utilize maximumpossible storage space on each logical array, and Linux allows only 14 usable partitions for a SCSI drive, wecreated 14 four GB partitions for a total of 56 GB allocatablestorage space on each array.
134
![Page 152: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/152.jpg)
We set the bandwidth overload threshold for all the arrays to50% in our experiments.
The intervalI over which the bandwidth utilization was approximated was chosen to be10s and the window sizeW over which reconfiguration decisions were made was chosen
to be100s. Note that these values are small, and were chosen to speed experimentation.
Load-balancing involving object remapping is a long-term operation and is typically done
only over periods of months or more. Finally, we used the 70th percentile of the samples
accumulated in the time windowW as a measure of the bandwidth utilization of anLV .
As mentioned before, the data migration could be achieved either online or offline.
While our implementation supports online data migration, to speed up experimentation,
we simply reconfigure the arrays for the new mapping of logical volumes. Techniques to
control the rate of online data migration to mitigate the performance impact on foreground
applications have been presented in [39].
We conducted two sets of experiments, one where theLV size or object size of all the
objects on the system was the same, and another where the object sizes differed.
5.6.2.1 Uniform Object Size
For the case of uniform object sizes, we present results fromtwo experiments, one were
the system had spare storage space in the form of unallocatedpartitions, and another where
the system had no spare space. While in the former, thedisplacestep would be invoked,
the latter would invoke theswapstep. In both experiments, the size of anyLV on an array
was 4 GB.
In the first experiment, three arrays were configured with 14LVs each, and the fourth
array was configured to have sevenLVs thus leaving half the array empty with seven al-
locatable partitions. For the first 100 seconds all theLVs in the system were accessed by
a workload with a concurrency of two; the mean think time was fixed at 400ms, 1000ms,
1000ms and 500ms for the workload accessingLVs on the four arrays, respectively. Figure
5.7(a) shows the estimated average bandwidth utilization as well as the cumulative average
135
![Page 153: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/153.jpg)
0
20
40
60
80
100
50 100 150 200 250 300
Per
cent
age
Ban
dwid
th U
tiliz
atio
n
Time (secs)
Bandwidth Utilization
Array 0Array 1Array 2Array 3
Overload Threshold
0
50
100
150
200
250
300
350
400
50 100 150 200 250 300
Cum
ulat
ive
Arr
ay IO
PS
(IO
s pe
r se
cond
)
Time (secs)
IOPS
Array 0Array 1Array 2Array 3
(a) Spare storage space.
0
20
40
60
80
100
50 100 150 200 250 300
Per
cent
age
Ban
dwid
th U
tiliz
atio
n
Time (secs)
Bandwidth Utilization
Array 0Array 1Array 2Array 3
Overload Threshold
0
50
100
150
200
250
300
350
400
50 100 150 200 250 300
Cum
ulat
ive
Arr
ay IO
PS
(IO
s pe
r se
cond
)
Time (secs)
IOPS
Array 0Array 1Array 2Array 3
(b) No spare storage space.
Figure 5.7.Uniform object size
IOPs across all theLVs for each array as a function of time. The average values reported
are over 10 second intervals.
As can be seen the array utilizations are dictated by the meanthink time values for the
workload accessing the componentLVs; lower mean think times mean higher utilization
values. Also note that the average bandwidth utilization estimated using kernel measure-
ments, and the average IOPs on each array based on measurements at the application level,
follow the same trend. This indicates that our kernel measurements correctly track the
application behavior.
136
![Page 154: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/154.jpg)
At t = 100s we increased the workload onArray 0 by increasing the concurrency of
half the clients to nine and that of the other half to four. This results in an increase in the
bandwidth utilization on the array. Note that the bandwidthutilization on the remaining
arrays remains unchanged. Att = 200s the reconfiguration module detects thatArray 0 is
overloaded, identifies a new balanced configuration, and triggers the appropriate reconfig-
uration. The reconfiguration involved moving twoLVs fromArray 0 to Array 3.
The reconfiguration module correctly identifiedArray 3 as the destination for theLVs,
even thoughArray 1 and2 had a lower bandwidth utilization, as it was the only array with
spare storage space. Moreover, of theLVs, it correctly chooses two of the seven logical
volumes being accessed by a workload with concurrency of nine. This choice minimizes
the amount of data displaced.
The graph fromt = 200s to t = 300s shows the utilization of the arrays after the recon-
figuration. As can be seen, the utilization ofArray 0 has decreased to a value close to the
overload threshold. The utilization ofArray 3, which now consists of two additional logi-
cal volumes, has increased appropriately. In our experiments, we allow for a soft threshold
of 2% around the bandwidth violation threshold, and consequently, no more reconfigura-
tions are triggered. This is done in order to avoid a reconfiguration for minor bandwidth
violations.
For the second experiment, we configured the storage system with no spare space and
each array comprised 14LVs. The workload for theLVs onArray 0 was the same as in the
previous experiment. The mean think times for workloads on Arrays 1 through 3, however,
were chosen to be 500ms, 500ms and 1000ms, respectively. Theclient concurrency of the
workload was fixed at two. Figure 5.7(b) plots the bandwidth utilization and cumulative
average IOPs for each array.
In this case, a reconfiguration is again triggered att = 200s as in the case above. The
reconfiguration involved swapping three heavily accessed logical volumes with three log-
ical volumes onArray 3, the array with the least load in the system. Note that despite the
137
![Page 155: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/155.jpg)
0
20
40
60
80
100
50 100 150 200 250 300
Per
cent
age
Ban
dwid
th U
tiliz
atio
n
Time (secs)
Bandwidth Utilization
Array 0Array 1Array 2Array 3
Overload Threshold
0
50
100
150
200
250
300
350
400
50 100 150 200 250 300
Cum
ulat
ive
Arr
ay IO
PS
(IO
s pe
r se
cond
)
Time (secs)
IOPS
Array 0Array 1Array 2Array 3
Figure 5.8. Variable object size; no spare storage space
workload onArray 0 being similar to that in the first experiment, a slightly different ob-
served utilization due to peculiarities of a real system, result in three logical volumes being
swapped, as opposed to two in the first experiment. Consequently, the drop in bandwidth
utilization is greater in this run. The reconfiguration results in a reduction in the bandwidth
utilization on the array to a value below the overload threshold.
5.6.2.2 Variable Object Size
For the case of the storage system configured withLVs with variable size, we ran ex-
periments for both the case when the system had spare space and when there was no spare
space. The results for the case the system had spare space were similar to that in the pre-
vious experiment; when a hotspot occurred the heavily accessed volumes were chosen and
moved to an array with spare space in order to minimize the amount of data displaced. To
avoid repetition, we do not present the results from that experiment.
For the case the storage system had no spare space, the systemconfiguration was as
follows. Array 0 and1 were configured with sixLVs each, twoLVs each of size 4 GB, 8
GB and 16 GB, respectively.Array 1and2 were configured with 14LVs each, each of size
4 GB. The mean think times for the workload accessing theLVs on the four arrays were
300ms, 500ms, 500ms and 1000ms, respectively. For the first 100s of the experiment, the
138
![Page 156: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/156.jpg)
concurrency for the workload for all theLVs on the system was fixed at two. Figure 5.8
shows the bandwidth utilization and cumulative IOPs as a function of time.
At t = 100s we increased the client concurrency for the workload accessing theLVs on
Array 0 and1 to seven and four, respectively. As can be see in the figure, this results in an
increases in the bandwidth utilization on both the arrays. However, onlyArray 0 observes
a violation of the bandwidth threshold. Att = 200s the reconfiguration module detects a
hotspot and triggers a reconfiguration. The reconfigurationinvolved swapping two 4 GB
LVs with twoLVs of the same size fromArray 3. So, the reconfiguration module correctly
identifiesArray 3as the array with the least load. Also, since all the sixLVs are configured
with the same workload, the twoLVs of size 4 GB are the one with maximumBSR.
The graph fromt = 200s to t = 300s shows that after the reconfiguration the utilization
on Array 0 has decreased to a value below the threshold.Array 3 with two newLVs with a
heavier load observes an increase in the utilization.
5.6.2.3 Implementation Overheads
Our final experiment aimed to study the overheads introduced, if any, due to the kernel
enhancements on application performance. While the computation involved in maintaining
statistics was minimal, the synchronization primitives employed (Section 5.5) may intro-
duce some overheads. So, in this experiment, we vary the number of logical volumes
actively being accessed on each array and compare the application performance for the
case statistics collection is enabled, to that of when statistics collection is disabled in the
kernel. The storage system was configured with 14LVs, each of size 4 GB, per array. The
workload accessing eachLV had a client concurrency of two and the mean think time was
set to zero. Note, that with no think time, each client would issue a new IO request imme-
diately after the previously issued request completes. Consequently, the storage system is
saturated.
139
![Page 157: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/157.jpg)
0
50
100
150
200
250
2 4 6 8 10 12 14
Cum
ulat
ive
Arr
ay IO
PS
(IO
s pe
r se
cond
)No. of Active Logical Volumes
Implementation Overheads
Stats Collection EnabledStats Collection Disabled
Figure 5.9. Impact on application performance.
Figure 5.9 shows the cumulative average IOPs for one of the arrays as a function of
the number ofLVs being actively accessed; each point corresponds to an average value
for a two minute run. The graph for the other arrays was the same. As can be seen, the
average IOPs figure in both the cases is similar. So, there wasnot noticeable overhead on
application performance. Since, the storage system is saturated, the average IOPs value
does not change significantly as the number of active logicalvolumes is varied. The slight
drop in the value in both cases, with an increase in the numberof activeLVs, is because
with a larger number ofLVs being accessed, the average seek latency increases. This is
because, for each additional contiguous partition being accessed on an array, the disk heads
on the component disks have to seek over a larger disk surfacewhen servicing requests.
5.6.3 Summary of Experimental Results
Our experiments show that for a variety of system configurations our novel approach
reduces the amount of data moved to remove the hotspot by a factor of two as compared
to other approaches. Moreover, the larger the system size orthe magnitude of overload the
greater the performance gap. Results from our prototype implementation suggested that our
kernel measurement techniques correctly track application behavior and identify hotspots.
For simple overload configurations considered, our techniques correctly remove the hotspot
140
![Page 158: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/158.jpg)
while minimizing the amount of data displaced. Finally, thekernel enhancements do not
result in any noticeable degradation in application performance.
5.7 Related Work
Algorithms for moving data objects from one configuration toanother in as few time
steps as possible have been presented in [25, 28, 35]. It is assumed that the new final
configuration is known. In our work, we seek to identify the new final configuration which
requires minimal data movement.
Techniques for initial storage system configuration have been presented in [7, 9]. Our
work assumes that the storage system is online, and presentstechniques to reconfigure the
system when workload hotspots occur with minimum data movement.
Load balancing at the granularity of files has been considered in [51]. The work as-
sumes contiguous storage space is available on lightly loaded disks to migrate file extents
from heavily loaded disks. Our work seeks to achieve load balancing at the granularity
of logical volumes and makes no assumptions about the distribution of spare space in the
storage system.
Techniques for moving data chunks between mirrored andRAID5 configurations within
an array based on their load for improving storage system performance have been proposed
in [61]. Our work seeks to achieve improved performance across the storage system by
moving logical volumes between arrays.
Disk load balancing schemes for video objects has been presented in [62]. Video objects
are assumed to be replicated and load balancing is achieved by changing the mapping of
video clients to replicas. In our work, logical volumes are assumed to have no replicas
across arrays and load balancing requires that identifyinga new mapping of data objects to
arrays.
Request throttling techniques to isolate the performance of applications accessing vol-
umes on a shared storage infrastructure have been explored in [33, 40, 57]. We present
141
![Page 159: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/159.jpg)
algorithms to improve storage system performance by migrating entire logical volumes
between arrays.
Finally, while [39] presents techniques for controlling the rate of data migration to
mitigate the instantaneous performance impact on foreground applications during online
reconfiguration, our work seeks to optimize for the scale of reconfiguration which dictates
the duration of performance impact.
5.8 Concluding Remarks
In this chapter, we argued that techniques employed to load-balance large scale storage
systems do not optimize for thescale of the reconfiguration—the amount of data displaced
to realize the new configuration.
Reconfiguring the system from scratch can incur significant data movement overhead.
Our novel approach uses the current object configuration as ahint; the goal being retain
most of the objects in place and thus limit the scale of the reconfiguration. We also de-
scribed a simple measurement-based technique for identifying hotspots and for approxi-
mating per-object bandwidth requirements.
Finally, we evaluated our techniques using a combination ofsimulation studies and an
evaluation of an implementation in the Linux kernel. Results from the simulation study
showed that for a variety of system configurations our novel approach reduces the amount
of data moved to remove the hotspot by a factor of two as compared to other approaches.
The gains increase for a larger system size and magnitude of overload. Experimental results
from a prototype evaluation suggested that the measurementtechniques correctly identify
workload hotspots. For some simple overload configurationsconsidered in the prototype
our approach identified a load-balanced configuration whichminimizes the amount of data
moved.
142
![Page 160: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/160.jpg)
CHAPTER 6
SUMMARY AND FUTURE WORK
In this thesis, we argued that improved manageability of a storage system is key to en-
suring its availability. The sheer size of these systems coupled with the complexity and
variability of application workloads that access them and the slew of of storage manage-
ment tasks, however, make storage management non-trivial.Traditionally, storage man-
agement tasks have been performed manually by administrators using a combination of
experience, rules of thumb and trial and error. However, such an approach increases the
chances of a misconfigured or sub-optimally configured system. This motivates the need
for an automated, seamless and intelligent way to manage thestorage resource.
Although, high level planning decisions do require human involvement, tasks such as
storage resource allocation are amenable tosoftware automationakin to aself-managing
systemwhich executes important operations without the need for human intervention.
Moreover, storage management tasks may need to be executed at multiple time scales.
Based on the time period at which management tasks need to be instantiated they can be
classified into three categories: initial configuration, short-term reconfiguration and long-
term reconfiguration.
In this thesis, we considered problems in each category witha focus on techniques for
automating storage resource management. In particular, weconsidered two storage alloca-
tion tasks: storage bandwidth allocation and storage spaceallocation. We now present a
summary of the contributions of this dissertation.
143
![Page 161: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/161.jpg)
6.1 Thesis Contributions
In this dissertation we made the following contributions.
6.1.1 Initial Storage System Configuration: Placement Techniques in a Self-managing
Storage System
The first step in storage management is deciding on a mapping of storage objects to disk
arrays. Object placement decisions are integral in determining application performance and
thus are crucial to the success of a storage system. For a self-managing storage system a
suitable placement technique is one that has low managementoverhead and delivers agree-
able performance.
Object placement techniques are based onstriping—a technique that interleaves the
placement of objects onto disks—and can be classified into two different categories: nar-
row and wide striping. From the perspective ofmanagement complexity, these two tech-
niques have fundamentally different implications. Whereas wide stripingstripes each ob-
ject across all the disks in the system and needs very little workload information for making
placement decisions,narrow stripingtechniques stripe an object across a subset of the disks
and employs detailed workload information to optimize the placement.
In this work, we performed a systematic study of the tradeoffs of narrow and wide
striping to determine their suitability for large-scale storage systems. The work involved
(i) simulations driven by OLTP traces and synthetic workloads, and (ii) experiments on a
40 disk storage system testbed.
The results showed that an idealized narrow striped system can outperform a compa-
rable wide-striped system for small requests. However, wide striping outperforms narrow
striped systems in the presence of workload skews that occurin real I/O workloads; the two
systems perform comparably for a variety of other real-world scenarios. The experiments
indicated that the additional workload information neededby narrow placement techniques
144
![Page 162: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/162.jpg)
may not necessarily translate to significantly better performance, and more specifically
does not outweigh the benefits ofmanagement simplicityinnate to a wide-striped system.
6.1.2 Short-term Storage System Reconfiguration: Bandwidth Allocation
In the context of dynamic bandwidth allocation we developedtwo techniques, one a
measurement-based inference technique and another based on learning.
Self-managing Bandwidth Allocation in a Multimedia File Server:
Large scale storage systems host data objects of multiple types which are accessed
by applications with diverse service requirements. For instance, a multimedia file server
services a heterogeneous mix ofsoft-real timestreaming media and traditionalbest-effort
requests. To provide QoS to both application types, employing a reservation-based ap-
proach, where the storage space is shared but a certain fraction of the bandwidth is re-
served for each class, has certain advantages. By sharing storage resources, the file server
can extractstatistical multiplexinggains; by reserving bandwidth, it can prevent interfer-
ence among classes and meet the performance guarantees of the soft-real time class. Thus,
a reservation-based approach has inherent advantages and flexibility which make it suitable
for a large-scale storage system.
Dynamic workload variations, as seen in modern file servers,may mean that one set
of reservations may not be suitable all the time. To address this limitation, in this thesis,
we develop techniques for self-managing bandwidth allocation in a multimedia file server.
In our scheme, we used online measurements to infer bandwidth requirements and guide
allocation decisions. A workload monitoring module tracked several parameters represen-
tative of the load within each class using amoving histogram. It tracked various aspects of
resource usage from the time a request arrives to the time it is serviced by the disk. Mon-
itored parameters include request arrival rates, request waiting times and disk utilizations
within each class.
145
![Page 163: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/163.jpg)
Requests within the best-effort class desire low average response times, while those
within the real-time class have associated deadlines that must be met. We instrumented an
existing disk scheduling algorithm which takes into account these disparate performance
requirement specifications while enforcing allocations and making scheduling decisions.
A simulation study using NFS file-server traces as well as synthetic workloads demon-
strated that our techniques (i) provide control over the time-scale of allocation via tunable
parameters, (ii) have stable behavior during overload, and(iii) provide significant advan-
tages over static bandwidth allocation.
Learning-based Approach for Dynamic Bandwidth Allocation:
An alternative to a measurement-based inference techniquefor bandwidth allocation
is reinforcement learning. An advantage of usingreinforcementlearning is that no prior
training of the system is required; the technique allows thesystem to learn online. More-
over, a learning based approach can also handle complexnon-linearityin system behavior.
In this problem, we assume multiple application classes each of which specify their QoS
requirement in the form of an average response time goal.
A simple learning approach is one that systematically triesout all possible allocations
for each system state, computes a cost function and stores these values to guide future
allocations. Although such a scheme is simple to design and implement, it has prohibitive
memory and search space requirements; this is because the number of possible allocations
increases exponentially with increase in the number of classes.
A key contribution of our work was the design of anenhancedlearning based approach
that uses the semantics of the problem to overcome the drawbacks of the naive learning
approach. The technique takes the current system state intoaccount while making alloca-
tion decisions and thereby avoids allocations that are clearly inappropriate for a particular
state; in other words, the optimized technique intelligently guides and restricts the alloca-
tion space explored. The design decisions result in substantial reduction in memory and
search space requirements, making a practical implementation feasible.
146
![Page 164: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/164.jpg)
We implemented these techniques in theLinux kernel. and used the software RAID
driver in Linux to configure the disk array. The results showed that (i) the use of learning
enables the storage system to reduce the impact of QoS violations by over a factor of two,
and (ii) the implementation overheads of employing such techniques in operating system
kernels is small.
6.1.3 Long-term Storage System Reconfiguration: AutomatedObject Remapping
Suitable initial placement obviates the need for frequent reconfiguration. And auto-
mated bandwidth allocation, which uses controlled requestthrottling, helps extract good
performance from the system in the face of transient workload changes. Persistent work-
load changes, which stress the storage system and result inhotspots, would deem it neces-
sary that the mapping of storage objects to arrays be tuned toensure agreeable performance.
Moving the system to a new configuration involves executing amigration plan, which
is a sequence of object moves. The reconfiguration itself could be carried out eitheronline
or offline. In both cases, the scale of the reconfiguration i.e., the amount of data that needs
to be displaced, is of consequence. While for an offline reconfiguration the scale of the
reconfiguration determines the duration of the reconfiguration and hence the downtime, for
an online reconfiguration it determines the duration of performance impact on foreground
applications. Existing approaches do not optimize for the scale of the reconfiguration,
possibly moving much more data than required to remove the hotspot.
To address this limitation, we developed algorithms to minimize the amount of data dis-
placed during a reconfiguration to remove hotspots in large-scale storage systems. Rather
than identifying a new configuration from scratch, which mayentail significant data move-
ment, our novel approach uses the current object configuration as ahint; the goal being
to retain most of the objects in place and thus limit the scaleof the reconfiguration. To
minimize the amount of data that needs to be moved we used agreedyapproach that uses
thebandwidth to space ratio(BSR) as a guiding metric. For example, by greedily select-
147
![Page 165: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/165.jpg)
ing high BSR objects for reassignment, one can displace morebandwidth per unit of data
moved. Finally, we used various optimizations, including searching for multiple solutions,
to counter some of the pitfalls in a greedy approach.
We evaluated our techniques using a combination of simulation studies and an evalua-
tion of an implementation in the Linux kernel. Results from the simulation study suggest
that for a variety of system configurations our novel approach reduces the amount of data
moved to remove the hotspot by a factor of two as compared to other approaches. The
gains increased for a larger system size and magnitude of overload. Experimental results
from the prototype evaluation suggested that our measurement techniques correctly identify
workload hotspots. For some simple overload configurationsconsidered in the prototype
our approach identifies a load-balanced configuration whichminimizes the amount of data
moved. Moreover, the kernel enhancements do not result in any noticeable degradation in
application performance.
6.2 Future Work
In this section, we discuss some future research directions.� Dynamic Bandwidth Allocation: In this thesis, we addressed the problem of dy-
namic bandwidth allocation. We addressed the problem for the case when applica-
tion classes specify their QoS requirement as an average response time goal. A useful
extension to this work would be where application classes could specify their QoS
requirement in dissimilar ways. An additional enhancementwould involve under-
standing and developing a way of identifying a QoS specification which is suitable
and realistic for each class and for the given storage system.� Automated Object Remapping (extensions):In the prototype for our automated
object remapping work we made the simplifying assumption that all arrays in the
storage system are similar. As future work we would like to develop techniques for
148
![Page 166: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/166.jpg)
quantifying the bandwidth requirements of objects, as wellas the bandwidth capaci-
ties of arrays, for a storage system comprising heterogeneous arrays.� Distributed Resource Management:Traditionally storage systems have been ei-
ther NAS-based (Network-attached Storage) or SAN-based (Storage Area Network).
While NAS offers ease of management, a SAN offers high throughput. An object-
based storage architecture[41] offers a middle-ground between the NAS and SAN
architectures, suitably blending the advantages of both. Active disks[4] tout the
benefits of moving computation closer to the data. Recent work [32] explores the
confluence of these two paradigms and presents techniques for leveraging the com-
putational capability at the storage device for interactive search of indexed data. In
this thesis, we focused on management of the storage resource. In the context of
active disks there are interesting problems in distributedresource management for
multiple resources. In particular, finding the right balance between computing at the
device and computing at the host with a knowledge of the network interconnect and
bandwidth capacity of the storage device would be key. Moreover, such an approach
should be self-managing, identifying the right tradeoff for diverse applications.
149
![Page 167: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/167.jpg)
APPENDIX
COMPARISON USING HOMOGENEOUS WORKLOADS
In this appendix we present the detailed results of our homogeneous workload simula-
tions experiments. We experiment with large requests that have a mean request size of 1
MB and a stripe unit size of 512KB. We repeat each experiment with small requests that
have a mean size of 4KB and a stripe unit size of 4KB. Unless specified otherwise, we
choose request rates that yield a utilization of around 60-65%; this corresponds to a mean
inter-arrival time of 17 ms for large requests and 4 ms for small requests, respectively.
Effect of System Size:We vary the number of arrays in the system from 1 to 10 and
measure the response times of requests in the narrow and widestriped system. Each array
in the system is accessed by a single stream in narrow striping and all streams access all
arrays in wide striping. Figure A.1 plots the results.
The figure shows that the performance of the two systems is similar over a range of
system sizes for both large and small requests. Increasing the system size results in inter-
ference between streams in wide striping since all stores span all arrays. However, since all
stores span all arrays, this also leads to better load balancing across arrays. As we increase
the system size, the benefits of load balancing balance the impact of interference, and the
response times remain almost unchanged.
Effect of Stripe Unit Size: In this experiment we study the impact of changing the
stripe unit size. Varying the stripe unit size of small requests did not have much impact, so
we omit the results. The stripe unit size of large requests was varied from 128 KB to 2 MB.
The average request size of the large requests was kept fixed at 1 MB. Figure A.2 plots the
results.
150
![Page 168: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/168.jpg)
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10
Mea
n R
espo
nse
Tim
e (
ms)
System Size (# of arrays)
Large Requests
Homogeneous Workload
0
5
10
15
20
1 2 3 4 5 6 7 8 9 10
Mea
n R
espo
nse
Tim
e (
ms)
System Size (# of arrays)
Small Requests
Homogeneous Workload
(a) Large Requests (b) Small Requests
Figure A.1. Homogeneous Workload: Effect of System Size
For large requests, when the stripe unit size is smaller as compared to the average re-
quest size, wide-striping gives higher response times as compared to narrow striping. This
is because, although a smaller stripe unit size results in increased parallelism, it also in-
creases the sequentiality breakdown and the probability ofinterference with requests from
streams accessing other stores. To wit, an average request size of 1 MB would result in 8
disk accesses for a stripe unit size of 128 KB, as compared to 2disk accesses for a stripe
unit size of 512 KB. The sequentiality of access is maintained in narrow-striping since all
requests for a store access the same array. An increase in thestripe unit size reduces the
extent of sequentiality breakdown, and narrow and wide striping give comparable perfor-
mance.
Effect of Utilization Level: In this experiment, we study the impact of utilization level
by varying the mean inter-arrival times (IA) of requests. The IA time for large (small)
requests is varied from 14 ms to 20 ms (3ms to 7ms) in steps of 1 ms. Figure A.3 shows
the results for the large and the small case, respectively.
Figure A.3 (a) shows that for large requests, as one decreases the IA times the relative
performance of narrow striping improves slightly. This is because, at low IA times the
request rate is higher, and streams see increased interference from other streams in wide
151
![Page 169: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/169.jpg)
0
50
100
150
200
128 256 512 1024 2048
Mea
n R
espo
nse
Tim
e (
ms)
Stripe-unit Size (KB)
Large Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
Figure A.2. Homogeneous Workload: Effect of Stripe-unit Size
striping. For larger IA times narrow and wide striping give comparable performance. Vary-
ing the IA times for smaller requests results in similar behavior (see Figure A.3 (b)); the
difference in response times between narrow and wide striping in this case however, are
smaller than that observed for large requests, because of the smaller transfer time of small
requests.
0
20
40
60
80
100
120
11 12 13 14 15 16 17 18 19 20
Mea
n R
espo
nse
Tim
e (
ms)
Mean Inter-arrival Time (ms)
Large Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
0
5
10
15
20
25
3 3.5 4 4.5 5 5.5 6 6.5 7
Mea
n R
espo
nse
Tim
e (
ms)
Mean Inter-arrival Time (ms)
Small Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
(a) Large Requests (b) Small Requests
Figure A.3. Homogeneous Workload: Effect of Utilization Level
Effect of Request Size:Next we study the effect of changing the request size. Varying
the request size of small requests did not have much impact sowe omit the results. The
152
![Page 170: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/170.jpg)
0
50
100
150
200
250
300
350
64 128 256 512 1024 2048
Mea
n R
espo
nse
Tim
e (
ms)
Mean Request Size (KB)
Large Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
Figure A.4. Homogeneous Workload: Effect of Request Size
request size of large requests is varied from 64 KB to 128 KB. The stripe unit size was
chosen to be half the average request size. Figure A.4 plots the results. Narrow and wide
striping give similar performance for most request sizes. For very large request sizes (2
MB), the interference between request streams results in wide striping giving slightly larger
response times.
Effect of Percentage of Writes:In this experiment we study the effect of varying the
percentage of writes. The percentage of writes was varied from 0 % to 90 %. We chose
inter-arrival times of 20 ms and 6 ms for large and small requests, respectively. Figure A.5
plots the results.
For large requests (see Figure A.5 (a)) we observe that as we increase the percentage
of writes the performance difference between narrow and wide striping increases, with
wide-striping giving higher response times. This is because, increasing the percentage
of writes, increases the background load due to dirty cache flushes, which increases the
interference seen by request streams in wide striping. Small requests (see Figure A.5 (b))
observe similar behavior; the impact of interference from background load due to dirty
cache flushes however, is less pronounced, due to the smallersize of requests.
153
![Page 171: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/171.jpg)
20
40
60
80
100
120
140
0 10 20 30 40 50 60 70 80 90
Mea
n R
espo
nse
Tim
e (
ms)
Percentage of Write Requests
Large Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
0
5
10
15
20
25
0 10 20 30 40 50 60 70 80 90
Mea
n R
espo
nse
Tim
e (
ms)
Percentage of Write Requests
Small Requests
System Size 1System Size 2System Size 3System Size 5
System Size 10System Size 15
(a) Large Requests (b) Small Requests
Figure A.5. Homogeneous Workload: Effect of Percentage of Writes
154
![Page 172: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/172.jpg)
BIBLIOGRAPHY
[1] Configuring the oracle database with veritas software and emcstorage. Tech. rep., Oracle Corporation. Available fromhttp://otn.oracle.com/deploy/availability/pdf/oracbook1.pdf.
[2] Emc symmetrix optimizer. Available from http://www.emc.com/products/storagemanagement/symmoptimizer.jsp.
[3] Abdelzaher, T., Shin, K. G., and Bhatti, N. Performance guarantees for web serverend-systems: A control-theoretical approach.IEEE Transactions on Parallel andDistributed Systems 13, 1 (Jan. 2002).
[4] Acharya, Anurag, Uysal, Mustafa, and Saltz, Joel H. Active disks: Programmingmodel, algorithms and evaluation. InArchitectural Support for Programming Lan-guages and Operating Systems(1998), pp. 81–91.
[5] Agrawal, S., Chaudhuri, S., Das, A., and Narasayya, V. Automating layout of rela-tional databases. InProceedings of the 19th International Conference on Data Engi-neering, Bangalore, India(2003).
[6] Allen, N. Don’t waste your storage dollars. Research Report, Gartner Group, March2001.
[7] Alvarez, G., Borowsky, E., Go, S., Romer, T., Becker-Szendy, R., Golding, R., Mer-chant, A., Spasojevic, M., Veitch, A., and Wilkes, J. Minerva: An automated resourceprovisioning tool for large-scale storage systems.ACM Transactions on ComputerSystems (to appear)(2002).
[8] Alvarez, G., Keeton, K, Merchant, A, Riedel, E, and Wilkes, J. Storage systemsmanagement. Tutorial presented at ACM Sigmetrics 2000, Santa Clara, CA, June2000.
[9] Anderson, E., Hobbs, M., Keeton, K., Spence, S., Uysal, M., and Veitch, A. Hippo-drome: Running circles around storage administration. InProceedings of the UsenixConference on File and Storage Technology (FAST’02), Monterey, CA(January 2002),pp. 175–188.
[10] Anderson, E., Kallahalla, M., Spence, S., Swaminathan, R., and Wang, Q. Ergastu-lum: An approach to solving the workload and device configuration problem. Tech.Rep. HPL-SSP-2001-05, HP Laboratories SSP, May 2001.
155
![Page 173: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/173.jpg)
[11] Anderson, E., Swaminathan, R., Veitch, A., Alvarez, G., and Wilkes, J. Selectingraid levels for disk arrays. InProceedings of the Conference on File and StorageTechnology (FAST’02), Monterey, CA(January 2002), pp. 189–201.
[12] Aron, M., Sanders, D., Druschel, P., and Zwaenepoel, W.Scalable content-awarerequest distribution in cluster-based network servers. InProceedings of the USENIX2000 Annual Technical Conference, San Diego, CA(June 2000).
[13] Barham, P. A fresh approach to file system quality of service. In Proceedings ofNOSSDAV’97, St. Louis, Missouri(May 1997), pp. 119–128.
[14] Borowsky, E., Golding, R., Jacobson, P., Merchant, A.,Schreier, L., Spasojevic,M., and Wilkes, J. Capacity planning with phased workloads.In Proceedings ofWOSP’98, Santa Fe, NM(October 1998).
[15] Borowsky, E., Golding, R., Merchant, A., Shriver, E., Spasojevic, M., and Wilkes,J. Eliminating storage headaches through self-management. In Proc. of the FirstSymposium on Operating System Design and Implementation (OSDI), Seattle, WA(October 1996).
[16] Breslau, L., Cao, P., Fan, L., Phillips, G., and Shenker, S. Web caching and zipf-likedistributions: Evidence and implications. InProceedings of Infocom’99, New York,NY (March 1999).
[17] Brown, A., Oppenheimer, D., Keeton, K., Thomas, R., Kubiatowicz, J., and Patterson,D.A. Istore: Introspective storage for data-intensive network services. InProceedingsof the 7th Workshop on Hot Topics in Operating Systems (HotOS-VII), Rio Rico, AZ(March 1999).
[18] Brown, A., and Patterson, D. A. Towards maintainability, availability, and growthbenchmarks: A case study of software raid systems. InProceedings of the USENIXAnnual Technical Conference, San Diego, CA(June 2000).
[19] Chase, J., Anderson, D., Thakar, P., Vahdat, A., and Doyle, R. Managing energy andserver resources in hosting centers. InProceedings of the Eighteenth ACM Symposiumon Operating Systems Principles (SOSP)(October 2001), pp. 103–116.
[20] Chen, P., and Patterson, D. Maximizing performance in astriped disk array. InProceedings of ACM SIGARCH Conference on Computer Architecture, Seattle, WA(May 1990), pp. 322–331.
[21] Chen, P. M., and Lee, E. K. Striping in a raid level 5 disk array. InProceedings of the1995 ACM SIGMETRICS Conference on Measurement and Modelingof ComputerSystems(May 1995).
[22] Chesire, M., Wolman, A, Voelker, G., and Levy, H. Measurement and analysis of astreaming workload. InProceedings of the USENIX Symposium on Internet Technol-ogy and Systems (USEITS), San Francisco, CA(March 2001).
156
![Page 174: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/174.jpg)
[23] Dahlin, M., Mather, C., Wang, R., Anderson, T., and Patterson, D. A quantitativeanalysis of cache policies for scalable network file systems. In Proceedings of ACMSIGMETRICS’94(May 1994).
[24] Dan, A., and Sitaram, D. An online video placement policy based on bandwidth andspace ratio. InProceedings of SIGMOD(May 1995), pp. 376–385.
[25] et al, Eric Anderson. An experimental study of data migration algorithms. InWAE:International Workshop on Algorithm Engineering(2001), LNCS.
[26] Flynn, R., and Tetzlaff, W. H. Disk striping and block replication algorithms forvideo file servers. InProceedings of IEEE International Conference on MultimediaComputing Systems (ICMCS)(1996), pp. 590–597.
[27] Gribble, S. D., Manku, G., Roselli, D., Brewer, E., Gibson, T., and Miller, E. Self-similarity in file systems. InProceedings of ACM SIGMETRICS ’98, Madison, WI(June 1998).
[28] Hall, Joseph, Hartline, Jason D., Karlin, Anna R., Saia, Jared, and Wilkes, John. Onalgorithms for efficient data migration. InSymposium on Discrete Algorithms(2001),pp. 620–629.
[29] Haskin, R. Tiger shark–a scalable file system for multimedia. IBM Journal of Re-search and Development 42, 2 (March 1998), 185–197.
[30] Hennessy, J. The future of systems research.IEEE Computer(August 1999), 27–33.
[31] Holton, M., and Das, R. XFS: A next generation journalled 64-bit file systemwith guaranteed rate i/o. Tech. rep., Silicon Graphics, Inc, Available online ashttp://www.sgi.com/Technology/xfs-whitepaper.html, 1996.
[32] Huston, Larry, Sukthankar, Rahul, Wickremesinghe, Rajiv, Satyanarayanan, Ma-hadev, Ganger, Gregory R., Riedel, Erik, and Ailamaki, Anastassia. Diamond: Astorage architecture for early discard in interactive search. In FAST(2004), pp. 73–86.
[33] Karlsson, Magnus, Karamanolis, Christos, and Zhu, Xiaoyun. Triage: Performanceisolation and differentiation for storage systems. InProceedings of the Interna-tional Workshop on Quality of Service (IWQoS 2004), Montreal, Canada(June 2004),pp. 67–74.
[34] Keeton, K., Patterson, D A., and Hellerstein, J. The case for intelligent disks (idisks).In Proceedings of the 24th Conference on Very Large Databases (VLDB) (August1998).
[35] Khuller, S., Kim, Y., and Wan, Y. Algorithms for data migration with cloning. InACM Symp. on Principles of Database Systems (2003).(2003).
[36] Lamb, E. Hardware spending matters.Red Herring(June 2001), 32–22.
157
![Page 175: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/175.jpg)
[37] Lee, E.K., and Katz, R.H. An analytic performance modelfor disk arrays. InPro-ceedings of the 1993 ACM SIGMETRICS(May 1993), pp. 98–109.
[38] Loaiza, J. Optimal storage configuration made easy. Tech. rep., Oracle Corporation.Available from http://otn.oracle.com/deploy/performance/pdf/optstorageconf.pdf.
[39] Lu, C., Alvarez, G., and Wilkes, J. Aqueduct: Online data migration with perfor-mance guarantees. InProceedings of the Usenix Conference on File and StorageTechnology (FAST’02), Monterey, CA(January 2002), pp. 219–230.
[40] Lumb, C., Merchant, A., and Alvarez, G. Facade: Virtualstorage devices with per-formance guarantees. InFAST’03(2003).
[41] Mesnier, M., Ganger, G., and Riedel, E. Object-based storage.IEEE CommunicationsMagazine 41, 8 (August 2003), 84–90.
[42] Molano, A., Juvva, K., and Rajkumar, R. Real-time file systems: Guaranteeing timingconstraints for disk accesses in rt-mach. InProceedings of IEEE Real-time SystemsSymposium(December 1997).
[43] Nerjes, G., Muth, P., Paterakis, M., Romboyannakis, Y., Triantafillou, P., andWeikum, G. Scheduling strategies for mixed workloads in multimedia informationservers. InProceedings of the 8th International Workshop on Research Issues in DataEngineering (RIDE’98), Orlando, Florida(February 1998).
[44] Nordstrom, E., and Carlstrom, J. A reinforcement learning scheme for adaptive linkallocation in atm networks. InProceedings of the International Workshop on Appli-cations of Neural Networks to Telecommunications 2, IWANN’T 95(1995).
[45] Patterson, D., Gibson, G., and Katz, R. A case for redundant array of inexpensivedisks (raid). InProceedings of ACM SIGMOD’88(June 1988), pp. 109–116.
[46] Patterson, D.A., Brown, A., Broadwell, P., Candea, G.,Chen, M., Cutler, J., Enriquez,P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff,W., J.Traupman, and Treuhaft, N. Recovery-oriented computing (roc): Motivation,definition, techniques, and case studies. InUC Berkeley Computer Science TechnicalReport UCB//CSD-02-1175(March 2002).
[47] Pradhan, P., Tewari, R., Sahu, S., Chandra, A., and Shenoy, P. An observation-based approach towards self-managing web servers. InProceedings of ACM/IEEEIntl Workshop on Quality of Service (IWQoS), Miami Beach, FL(May 2002).
[48] Revel, D., McNamee, D., Pu, C., Steere, D., and Walpole,J. Feedback based dynamicproportion allocation for disk i/o. Tech. Rep. CSE-99-001,OGI CSE, January 1999.
[49] Riedel, E., Gibson, G A., and Faloutsos, C. Active storage for large-scale data miningand multimedia. InProceedings of the 24th international Conference on Very LargeDatabases (VLDB ’98), New York, NY(August 1998).
158
![Page 176: SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler Trafford has been most helpful with configuring the Lin ux cluster and the storage testbed](https://reader035.fdocuments.us/reader035/viewer/2022071117/6001f69ce9691e0a9272d8ae/html5/thumbnails/176.jpg)
[50] Scheuermann, P., Weikum, G., and Zabback, P. Data partitioning and load balancingin parallel disk systems.VLDB Journal 7, 1 (1998), 48–66.
[51] Scheuermann, Peter, Weikum, Gerhard, and Zabback, Peter. Data partitioning andload balancing in parallel disk systems.VLDB Journal: Very Large Data Bases 7, 1(1998), 48–66.
[52] Seltzer, M., and Small, C. Self-monitoring and self-adapting systems. InProceed-ings of the 1997 Workshop on Hot Topics on Operating Systems,Chatham, MA(May1997).
[53] Shenoy, P., Goyal, P., and Vin, H M. Architectural considerations for next generationfile systems. InProceedings of the Seventh ACM Multimedia Conference, Orlando,FL (November 1999).
[54] Shenoy, P, and Vin, H M. Cello: A disk scheduling framework for next generationoperating systems. InProceedings of ACM SIGMETRICS Conference, Madison, WI(June 1998), pp. 44–55.
[55] Singh, S., and Bertsekas, D. Reinforcement learning for dynamic channel allocationin cellular telephone systems. InAdvances in Neural Information Processing Systems9 (NIPS)(1997), pp. 974–980.
[56] Sundaram, V., and Shenoy, P. Bandwidth allocation in a self-managing multimediafile server. InProceedings of the Ninth ACM Conference on Multimedia, Ottawa,Canada(October 2001).
[57] Sundaram, V., and Shenoy, P. A practical learning-based approach for dynamic stor-age bandwidth allocation. InProceedings of ACM/IEEE Intl Workshop on Quality ofService (IWQoS), Monterey, CA(June 2003), pp. 479–497.
[58] Sutton, R S., and Barto, A G.Reinforcement Learning: An Introduction. MIT Press,Cambridge, MA.
[59] Ward, J., O’Sullivan, M., Shahoumian, T., and Wilkes, J. Hippodrome: Runningcircles around storage administration. InProceedings of the Usenix Conference onFile and Storage Technology (FAST’02), Monterey, CA(January 2002), pp. 203–217.
[60] Wijayaratne, R., and Reddy, A. L. N. Providing qos guarantees for disk i/o. Tech. Rep.TAMU-ECE97-02, Department of Electrical Engineering, Texas A&M University,1997.
[61] Wilkes, J., Golding, R., Staelin, C., and Sullivan, T. The hp autoraid hierarchicalstorage system. InProceedings of the Fifteenth ACM Symposium on Operating SystemPrinciples, Copper Mountain Resort, Colorado(Decmember 1995), pp. 96–108.
[62] Wolf, J., Yu, P. S., and Shachnai, H. Dasd dancing- a diskload balancing optimizationscheme for video-on-demand computer systems. InProceedings of ACM SIGMET-RICS’95(1995), pp. 157–166.
159