SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler...

SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCEMANAGEMENT

A Dissertation Presented

by

VIJAY SUNDARAM

Submitted to the Graduate School of theUniversity of Massachusetts Amherst in partial fulfillment

of the requirements for the degree of

DOCTOR OF PHILOSOPHY

February 2006

Computer Science

c Copyright by Vijay Sundaram 2006

All Rights Reserved


A Dissertation Presented

by

VIJAY SUNDARAM

Approved as to style and content by:

Prashant Shenoy, Chair

Mark Corner, Member

C. Mani Krishna, Member

James Kurose, Member

W. Bruce Croft, Department ChairComputer Science

Good design comes from experience. Experience comes from bad design.

ACKNOWLEDGMENTS

The years spent in Amherst doing my PhD have been a fulfilling and illuminating expe-

rience. Many people have contributed in significant ways andhelped me see this through.

First and foremost I would like to thank my advisor ProfessorPrashant Shenoy for his

expert guidance over the years. I would like to thank membersof my thesis committee—

Professors Mark Corner, Mani Krishna and Jim Kurose. I wouldalso like thank Dr. Sumit

Roy and Dr. Pawan Goyal for collaborating with me in my research and for their excellent

mentoring during my internships at HP Labs and IBM Almaden. Also, thanks are due to

Sumit for helping me with valuable career advice.

Tyler Trafford has been most helpful with configuring the Linux cluster and the storage

testbed in the context of my research. My heartfelt thanks goto Sharon Mallory, Pauline

Hollister, Betty Hardy and Karren Sacco who made things simpler by helping eagerly with

various administrative issues.

My friends, fellow students and colleagues have played a significant role in this journey.

I would like to thank Atul Maharshi and Upendra Sharma, friends from IIT, who have kept

me laughing over the years. I would like to thank Ramesh Nallapatti for his eager help

whenever I was in need. Abhishek Chandra and Bhuvan Urgaonkar have been great friends

and labmates. Purushottam Kulkarni and Peter Desonyers have been very helpful with

practice talks and comments on paper drafts. In Rahul Gupta,Subhrangshu Nandi and

Pranesh Venugopal I found great housemates who made my stay in Amherst a pleasant

one. My sincere thanks to my friends Harpal Singh Bassalli, Swati Birla, Yu Gu, Kishore

Indukuri, Pallika Kanani, Anoop George Ninan, Hema Raghavan and Aparna Vemuri. Atul

Sheel and Rashmi Sheel provided me with a home away from home in Amherst.

v

The constant encouragement, belief and support of my parents, Col. M.M. Sundaram

and Harsha Sundaram, and my brother Ajay Sundaram, has been instrumental in my achieve-

ments. Last but not the least, I thank my wife Kavita Jaswal for her support and confidence,

egging me to go on and never cave in, no matter what.

vi

ABSTRACT


FEBRUARY 2006

VIJAY SUNDARAM

B.Tech., INDIAN INSTITUTE OF TECHNOLOGY BOMBAY

M.S., UNIVERSITY OF MASSACHUSETTS AMHERST

Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST

Directed by: Professor Prashant Shenoy

The increasing reliance on online information in our daily lives had called for a rethink-

ing of how people manage and maintain computer systems. As information has become

more valuable and computing environments more complex, improved manageability has

become key to ensuring availability. The sheer size of enterprise-scale storage systems

coupled with the diversity and variability of application workloads makes their manage-

ment non-trivial. Not surprisingly, numerous studies haveshown that management costs

have become a significant fraction of the total cost of ownership of large storage systems.

Traditionally storage management tasks have been performed manually by administrators

using a combination of experience, rules of thumb and trial and error. This increases the

chance of a misconfigured or sub-optimally configured system. The cost of such miscon-

figurations can be high since even a short downtime can resultin substantial revenue losses.

vii

So, although storage is cheap, storage management is costlyand storage mismanagement

costlier. This argues the need for an automated, seamless and intelligent way to manage

the storage resource.

In this thesis, I propose self-managing techniques, specifically for resource management,

to improve the manageability of large-scale storage systems. I have focused on techniques

for automating two common storage allocation tasks: storage bandwidth allocation and

storage space allocation. Large scale storage systems hostdata objects of multiple types

which are accessed by applications with diverse service requirements. I have developed an

online measurement based technique as well as one based on learning to dynamically par-

tition bandwidth between application classes. Storage allocation algorithms that determine

object placement, and thus the performance, are crucial to the success of a storage sys-

tem. For a self-managing storage system a suitable placement technique is one that has low

management overhead and delivers agreeable performance. In this context, I empirically

compare different placement techniques to determine theirsuitability for large-scale stor-

age systems. Finally, I also present techniques to minimizethe amount of data displaced

when remapping objects to eliminate hotspots.

viii

TABLE OF CONTENTS

Page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .vii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . xiv

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . xv

CHAPTER

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 11.2 Automating Storage Resource Management . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 31.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4

1.3.1 Initial Storage System Configuration: Placement Techniques in aSelf-managing Storage System . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 4

1.3.2 Short-term Storage System Reconfiguration: BandwidthAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 5

1.3.3 Long-term Storage System Reconfiguration: AutomatedObjectRemapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 7

1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9

2. PLACEMENT TECHNIQUES IN A SELF-MANAGING STORAGESYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . .10

2.1 Background and Problem Description . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 122.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 15

2.2.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 172.2.2 Ideal Narrow Striping versus Wide Striping . . . . . . . . .. . . . . . . . . . . . 20

2.2.2.1 Comparison using Homogeneous Workloads . . . . . . . . .. . . 20

ix

2.2.2.2 Comparison using Heterogeneous Workloads . . . . . . .. . . . . 222.2.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 30

2.2.3 Impact of Inter-Stream Interference . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 312.2.4 Impact of Load Skews: Trace-driven Simulations . . . . .. . . . . . . . . . . . 322.2.5 Experiments on a Storage System Testbed . . . . . . . . . . . .. . . . . . . . . . . 36

2.2.5.1 Synthetic Workload . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 362.2.5.2 TPC-H Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 362.2.5.3 TPC-C Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 38

2.3 Summary and Implications of our Experimental Results . .. . . . . . . . . . . . . . . . 412.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 442.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 45

3. SELF-MANAGING BANDWIDTH ALLOCATION IN A MULTIMEDIAFILE SERVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 46

3.1 Self-Managing Bandwidth Allocation: Problem Definition . . . . . . . . . . . . . . . . 473.2 Self-Managing Bandwidth Allocation in a Single Disk Server . . . . . . . . . . . . . 50

3.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 503.2.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 513.2.3 Monitoring the Workload in the Two Classes . . . . . . . . . .. . . . . . . . . . . 523.2.4 Adapting the Allocation of Each Class . . . . . . . . . . . . . .. . . . . . . . . . . . 54

3.2.4.1 Estimating Bandwidth Requirement based on DiskUtilizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2.4.2 Estimating Bandwidth Requirement based on theArrival Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2.4.3 Computing the Reservations of Each Class . . . . . . . . .. . . . . 57

3.3 Self-Managing Bandwidth Allocation in a Multi-disk Server . . . . . . . . . . . . . . 593.4 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 62

3.4.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 623.4.2 Workload Characteristics . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 63

3.4.2.1 Best-effort Text Clients . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 633.4.2.2 Soft Real-time Video Clients . . . . . . . . . . . . . . . . . . .. . . . . . 64

3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 65

3.5.1 Ability to Adapt to Changing Workloads . . . . . . . . . . . . .. . . . . . . . . . . 653.5.2 Bandwidth Allocation in a Single-disk Server . . . . . . .. . . . . . . . . . . . . 663.5.3 Bandwidth Allocation in a Multi-disk Server . . . . . . . .. . . . . . . . . . . . . 68

x

3.5.4 Impact of Tunable Parameters . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 703.5.5 Comparison with Static Allocation . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 71

3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 743.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 74

4. LEARNING-BASED APPROACH FOR DYNAMIC BANDWIDTHALLOCATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 75

4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 75

4.1.1 Background and System Model . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 754.1.2 Key Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 764.1.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 78

4.2 A Learning-based Approach . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 78

4.2.1 Reinforcement Learning Background . . . . . . . . . . . . . . .. . . . . . . . . . . . 794.2.2 System State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 804.2.3 Allocation Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 804.2.4 Cost and State Action Values . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 814.2.5 A Simple Learning-based Approach . . . . . . . . . . . . . . . . .. . . . . . . . . . . 824.2.6 An Enhanced Learning-based Approach . . . . . . . . . . . . . .. . . . . . . . . . . 84

4.3 Implementation in Linux . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 894.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 90

4.4.1 Simulation Methodology and Workload . . . . . . . . . . . . . .. . . . . . . . . . . 904.4.2 Effectiveness of Dynamic Bandwidth Allocation . . . . .. . . . . . . . . . . . . 914.4.3 Comparison with Alternative Approaches . . . . . . . . . . .. . . . . . . . . . . . 934.4.4 Effect of Tunable Parameters . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 964.4.5 Implementation Experiments . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 974.4.6 Implementation Overheads . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 99

4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 1004.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 102

5. AUTOMATED OBJECT REMAPPING FOR LOAD BALANCINGLARGE SCALE STORAGE SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 103

5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 1035.1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 105

5.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 107

xi

5.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 1075.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 107

5.3 Object Remapping Techniques . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 110

5.3.1 Cost Oblivious Object Remapping . . . . . . . . . . . . . . . . . .. . . . . . . . . . 111

5.3.1.1 Randomized Packing . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 1115.3.1.2 BSR-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 112

5.3.2 Cost-aware Object Remapping . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 113

5.3.2.1 Randomized Object Reassignment . . . . . . . . . . . . . . . .. . . . 1135.3.2.2 Displace and Swap . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 114

5.4 Measuring Bandwidth Requirements and Detecting Hotspots . . . . . . . . . . . . . 1225.5 Implementation Considerations . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 1245.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 126

5.6.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 126

5.6.1.1 Impact of System Size . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 1285.6.1.2 Impact of System Bandwidth Utilization . . . . . . . . . .. . . . . 1305.6.1.3 Impact of System Space Utilization . . . . . . . . . . . . . .. . . . . 1315.6.1.4 Impact of Optimizations . . . . . . . . . . . . . . . . . . . . . . .. . . . . 132

5.6.2 Prototype Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 134

5.6.2.1 Uniform Object Size . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 1355.6.2.2 Variable Object Size . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 1385.6.2.3 Implementation Overheads . . . . . . . . . . . . . . . . . . . . .. . . . . 139

5.6.3 Summary of Experimental Results . . . . . . . . . . . . . . . . . .. . . . . . . . . . 140

5.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 1415.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 142

6. SUMMARY AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143

6.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 144

6.1.1 Initial Storage System Configuration: Placement Techniques in aSelf-managing Storage System . . . . . . . . . . . . . . . . . . . . . . . . .. . . 144

6.1.2 Short-term Storage System Reconfiguration: BandwidthAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 145

6.1.3 Long-term Storage System Reconfiguration: AutomatedObjectRemapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 147

xii

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 148

APPENDIX: COMPARISON USING HOMOGENEOUS WORKLOADS . . . . . .150

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 155

xiii

LIST OF TABLES

Table Page

2.1 Characteristics of the Fujitsu Disk . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 16

2.2 Summary of the Traces. IOPS denotes the number of I/O operations persecond. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 19

2.3 TPC-C and Sequential Workload Throughput in Narrow and WideStriping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 41

3.1 Characteristics of the Auspex NFS trace . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 63

3.2 Characteristics of Video traces . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 65

xiv

LIST OF FIGURES

Figure Page

2.1 Narrow and wide striping in an enterprise storage system. . . . . . . . . . . . . . . . . . 12

2.2 Effect of system size for homogeneous closed-loop workloads. Systemsize of 1 depicts narrow striping. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 21

2.3 Homogeneous Workload: Closed-loop Testbed Experiments . . . . . . . . . . . . . . . 22

2.4 Effect of system size for heterogeneous Poisson workloads. System sizeof 1 depicts narrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 24

2.5 Effect of varying the stripe unit size of large requests.System size of 1depicts narrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 25

2.6 Effect of the inter-arrival times of large requests. System size of 1 depictsnarrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 26

2.7 Effect of inter-arrival times of small requests. Systemsize of 1 depictsnarrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 27

2.8 Effect of request size of large requests. System size of 1depicts narrowstriping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 28

2.9 Effect of percentage of large write requests. System size of 1 depictsnarrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 29

2.10 Effect of percentage of small write requests. System size of 1 depictsnarrow striping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 30

2.11 Impact of inter-stream interference. System size of 1 depicts narrowstriping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 32

2.12 Trace Driven Simulations . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 34

2.13 Trace Driven Simulations with Load Imbalance . . . . . . . .. . . . . . . . . . . . . . . . . 35

xv

2.14 Heterogeneous Workload: Closed-loop Testbed Experiments . . . . . . . . . . . . . . 37

2.15 Comparison using the TPC-H Benchmark . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 39

3.1 Three techniques for supporting multiple application classes at a fileserver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 47

3.2 A Moving Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 52

3.3 Parameters tracked by the monitoring module . . . . . . . . . .. . . . . . . . . . . . . . . . . 54

3.4 Bursty nature of the NFS trace workload. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 64

3.5 Adaptive allocation of disk bandwidth . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 67

3.6 Bandwidth allocation in a single-disk server. . . . . . . . .. . . . . . . . . . . . . . . . . . . . 67

3.7 Bandwidth allocation in a multi-disk server. . . . . . . . . .. . . . . . . . . . . . . . . . . . . 69

3.8 Effect of various tunable parameters on the granularityof bandwidthallocations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 71

3.9 Comparison with Static Partitioning . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 73

4.1 Relationship between application classes, logical volumes and logical units.. . . . . . . 77

4.2 Discretizing the State Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3 Steps involved in learning . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 84

4.4 Algorithm flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 85

4.5 Behavior of the learning-based dynamic bandwidth allocationtechnique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 92

4.6 Comparison with Alternative Approaches . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 93

4.7 Impact of Tunable Parameters . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 95

4.8 Results from our prototype implementation. . . . . . . . . . .. . . . . . . . . . . . . . . . . . 99

4.9 Memory overheads of the bandwidth allocator. . . . . . . . . .. . . . . . . . . . . . . . . . 101

5.1 System model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 108

xvi

5.2 Illustration of Displace and Swap. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 122

5.3 Impact of system size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 129

5.4 Impact of bandwidth utilization. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 130

5.5 Impact of space utilization. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 132

5.6 Impact of optimizations. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 133

5.7 Uniform object size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 136

5.8 Variable object size; no spare storage space . . . . . . . . . .. . . . . . . . . . . . . . . . . . 138

5.9 Impact on application performance. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 140

A.1 Homogeneous Workload: Effect of System Size . . . . . . . . . .. . . . . . . . . . . . . . 151

A.2 Homogeneous Workload: Effect of Stripe-unit Size . . . . .. . . . . . . . . . . . . . . . 152

A.3 Homogeneous Workload: Effect of Utilization Level . . . .. . . . . . . . . . . . . . . . 152

A.4 Homogeneous Workload: Effect of Request Size . . . . . . . . .. . . . . . . . . . . . . . 153

A.5 Homogeneous Workload: Effect of Percentage of Writes . .. . . . . . . . . . . . . . . 154

xvii

CHAPTER 1

INTRODUCTION

1.1 Motivation

Enterprise-scale storage systems are complex systems consisting of tens or hundreds of

storage devices. Due to the sheer size of these systems coupled with the complexity of the

application workloads that access them, storage systems are becoming increasingly diffi-

cult to design, configure and manage. Storage system management comprises of a slew of

administration tasks ranging from how much and what storageto buy, to how to map stor-

age objects to disk arrays. Moreover, reconfiguration and tuning is required on a continual

basis to deal with changes in workload or incremental growth. Not surprisingly numerous

studies have shown that management costs far outstrip equipment costs and have become a

the dominant fraction (75-90%) of the total cost of ownership of large computing systems

[46, 6, 36]. Overprovisioning in such large-scale systems to alleviate the management com-

plexity can be expensive and may not be compensating even in awell-configured system,

especially so since diverse workloads and changing requirements mute the notion of one

flawless configuration.

Traditionally, storage management tasks have been performed manually by administra-

tors who use a combination of experience, rules of thumb and trial and error. This increases

the chances of misconfigured or sub-optimally configured system. In an age where infor-

mation is increasingly available online, the cost of such misconfigurations can be high

since even a short down-time can result in substantial revenue losses. So,although storage

is cheap, storage management is costly and storage mismanagement costlier still. In fact, it

has been argued that the problems of maintainability, availability and growth of computing

1

systems have overshadowed that of performance and that the traditional focus on perfor-

mance is less important in today’s environments [18, 30]. These arguments motivate the

need for an automated, seamless and intelligent way to manage the storage resource.

Although high level planning decisions do require human involvement, tasks such as

storage resource allocation are amenable tosoftware automationakin to aself-managing

systemwhich executes important operations without the need for human intervention. The

primary research challenge is to ensure that the system provides performance that is com-

parable to a human-managed system, but at a lower cost.

How often a management task needs to be instantiated dependson a number of factors:� The specifics of the task. What is the inherent nature of the task?� The initial configuration. Is the system ill-configured or well-configured?� Changing workload patterns. What is the time-scale over which workload changes?

Whereas some management tasks require attention on the short-term, say over a period

of hours to days, others need to be dealt with only over longertime periods ranging from

months to years. For example, backups could be carried out ona daily basis on critical

data and maybe required only once a week on less important data. Adding new and faster

storage devices to the storage system may be required less often, ranging over periods of

months to years. So, we see that the same management task may require attention possibly

over multiple time scales.

The initial configuration of the system may also play a role inhow often is adminis-

tration required. If the system has been configured with an eye to future growth and/or

the workload requirements etc., as is necessary, it may easethe task of the administrator.

However, an ill-configured system, where for example logical volumes run out of space on

frequent intervals, or heavily accessed logical volumes have been collocated on a storage

device resulting in hotspots and performance degradation,may require frequent reconfigu-

ration.

2

Changing workloads may also effect an immediate reconfiguration. For example, if we

see a sudden increase in workload for a class of applicationswhich do not have sufficient

bandwidth resources to absorb the burst in workload, the system resources may need to be

reallocated; the time-period of this task again depends on the burstiness of the workload.

Finally, an interplay of these factors may also guide the time-period of the management

task. For example, changing workload patterns in an ill-configured system versus a well-

configured system may require widely different amounts of reconfiguration efforts.

A well-designed self-managing technique should take into account all of these factors,

together with the associated anomalies, and trigger the requisite reconfiguration as and

when necessary.

1.2 Automating Storage Resource Management

In the previous section, we argued that automating storage management is crucial and

a multitude of factors make the task of automating storage management challenging. The

impetus for automating storage management came from [15, 8]. Autonomic computing is

another term often used to refer to the notion of self-managing computing systems. Such

a system is self-configuring, self-optimizing, self-healing and self-securing. In the context

of computing systems the high-level goal is to improve system management and reliability.

The eventual goal is to have a system that does not need anyoneto manage and maintain it

once it has been installed. In this thesis, we focus on the storage management component

of autonomic computing.

Storage management tasks can broadly be classified into three categories: initial con-

figuration, short-term reconfiguration and long-term reconfiguration. Initial configuration

refers to the task of initial storage system configuration; tasks such as object placement,

RAID-level tagging, configuring the network connectivity in a storage system etc., fall into

this category. Short-term reconfiguration refers to tasks which require attention on a contin-

ual basis and include bandwidth allocation between application classes, extending logical

3

volumes etc. Finally, long-term reconfiguration refers to tasks which need to be invoked

when a short-term reconfiguration is insufficient to ensure acceptable storage system per-

formance. These include migrating the system to new devices, data migration to remove

long-term workload hotspots etc.

In this thesis, we consider problems in each category. In particular, the problems ad-

dressed are from the perspective of resource management. Resource management in a

storage systems aims at ensuring that the storage system gives agreeable performance and

the storage resources are used efficiently. The storage resource comprises two components:

the storage space and storage bandwidth. We gave developed techniques for automating the

allocation of both resources1.

1.3 Thesis Contributions

In this section, we elaborate on the contributions of the thesis, and discuss challenges in-

volved in automating storage resource management. We classify these contributions based

on the time scale of the management task.

1.3.1 Initial Storage System Configuration: Placement Techniques in a Self-managing

Storage System

The first step in storage management is deciding on a mapping of storage objects to disk

arrays. Object placement decisions are integral in determining application performance and

thus are crucial to the success of a storage system. For a self-managing storage system a

suitable placement technique is one that has low managementoverhead and delivers agree-

able performance.

Object placement techniques are based onstriping—a technique that interleaves the

placement of objects onto disks—and can be classified into two different categories: nar-

1Note, that here by storage space allocation we refer to spaceallocation at the granularity of logicalvolumes. Space internal to a logical volume is managed by a file system or a database manager as may be thecase.

4

row and wide striping. From the perspective ofmanagement complexity, these two tech-

niques have fundamentally different implications. Whereas wide stripingstripes each ob-

ject across all the disks in the system and needs very little workload information for making

placement decisions,narrow stripingtechniques stripe an object across a subset of the disks

and employs detailed workload information to optimize the placement.

In this work, we perform a systematic study of the tradeoffs of narrow and wide striping

to determine their suitability for large-scale storage systems. The work involved (i) simu-

lations driven by OLTP traces and synthetic workloads, and (ii) experiments on a 40 disk

storage system testbed.

The results show that an idealized narrow striped system canoutperform a comparable

wide-striped system for small requests. However, wide striping outperforms narrow striped

systems in the presence of workload skews that occur in real I/O workloads; the two sys-

tems perform comparably for a variety of other real-world scenarios. The experiments indi-

cate that the additional workload information needed by narrow placement techniques may

not necessarily translate to significantly better performance, and more specifically does not

outweigh the benefits ofmanagement simplicityinnate to a wide-striped system.

1.3.2 Short-term Storage System Reconfiguration: Bandwidth Allocation

In the context of dynamic bandwidth allocation we develop two techniques, one a meas-

urement-based inference technique and another based on learning.

Self-managing Bandwidth Allocation in a Multimedia File Server:

Large scale storage systems host data objects of multiple types which are accessed

by applications with diverse service requirements. For instance, a multimedia file server

services a heterogeneous mix ofsoft-real timestreaming media and traditionalbest-effort

requests. To provide QoS to both application types, employing a reservation-based ap-

proach, where the storage space is shared but a certain fraction of the bandwidth is re-

served for each class, has certain advantages. By sharing storage resources, the file server

5

can extractstatistical multiplexinggains; by reserving bandwidth, it can prevent interfer-

ence among classes and meet the performance guarantees of the soft-real time class. Thus,

a reservation-based approach has inherent advantages and flexibility which make it suitable

for a large-scale storage system.

Dynamic workload variations, as seen in modern file servers,may mean that one set

of reservations may not be suitable all the time. To address this limitation, in this thesis,

we develop techniques for self-managing bandwidth allocation in a multimedia file server.

In our scheme, we use online measurements to infer bandwidthrequirements and guide

allocation decisions. A workload monitoring module tracksseveral parameters represen-

tative of the load within each class using amoving histogram. It tracks various aspects of

resource usage from the time a request arrives to the time it is serviced by the disk. Mon-

itored parameters include request arrival rates, request waiting times and disk utilizations

within each class.

Requests within the best-effort class desire low average response times, while those

within the real-time class have associated deadlines that must be met. We instrument an

existing disk scheduling algorithm which takes into account these disparate performance

requirement specifications while enforcing allocations and making scheduling decisions.

A simulation study using NFS file-server traces as well as synthetic workloads demon-

strates that our techniques (i) provide control over the time-scale of allocation via tunable

parameters, (ii) have stable behavior during overload, and(iii) provide significant advan-

tages over static bandwidth allocation.

Learning-based Approach for Dynamic Bandwidth Allocation:

An alternative to a measurement-based inference techniquefor bandwidth allocation

is reinforcement learning. An advantage of usingreinforcementlearning is that no prior

training of the system is required; the technique allows thesystem to learn online. More-

over, a learning based approach can also handle complexnon-linearityin system behavior.

6

In this problem, we assume multiple application classes each of which specify their QoS

requirement in the form of an average response time goal.

A simple learning approach is one that systematically triesout all possible allocations

for each system state, computes a cost function and stores these values to guide future

allocations. Although such a scheme is simple to design and implement, it has prohibitive

memory and search space requirements; this is because the number of possible allocations

increases exponentially with increase in the number of classes.

A key contribution of our work is the design of anenhancedlearning based approach

that uses the semantics of the problem to overcome the drawbacks of the naive learning

approach. The technique takes the current system state intoaccount while making alloca-

tion decisions and thereby avoids allocations that are clearly inappropriate for a particular

state; in other words, the optimized technique intelligently guides and restricts the alloca-

tion space explored. The design decisions result in substantial reduction in memory and

search space requirements, making a practical implementation feasible.

We implement these techniques in theLinux kernel. and use the software RAID driver

in Linux to configure the disk array. The results show that (i)the use of learning enables

the storage system to reduce the impact of QoS violations by over a factor of two, and (ii)

the implementation overheads of employing such techniquesin operating system kernels is

small.

1.3.3 Long-term Storage System Reconfiguration: AutomatedObject Remapping

Suitable initial placement obviates the need for frequent reconfiguration. And auto-

mated bandwidth allocation, which uses controlled requestthrottling, helps extract good

performance from the system in the face of transient workload changes. Persistent work-

load changes, which stress the storage system and result inhotspots, would deem it neces-

sary that the mapping of storage objects to arrays be tuned toensure agreeable performance.

7

Moving the system to a new configuration involves executing amigration plan, which

is a sequence of object moves. The reconfiguration itself could be carried out eitheronline

or offline. In both cases, the scale of the reconfiguration i.e., the amount of data that needs

to be displaced, is of consequence. While for an offline reconfiguration the scale of the

reconfiguration determines the duration of the reconfiguration and hence the downtime, for

an online reconfiguration it determines the duration of performance impact on foreground

applications. Existing approaches do not optimize for the scale of the reconfiguration,

possibly moving much more data than required to remove the hotspot.

To address this limitation, we develop algorithms to minimize the amount of data dis-

placed during a reconfiguration to remove hotspots in large-scale storage systems. Rather

than identifying a new configuration from scratch, which mayentail significant data move-

ment, our novel approach uses the current object configuration as ahint; the goal being

to retain most of the objects in place and thus limit the scaleof the reconfiguration. To

minimize the amount of data that needs to be moved we use agreedyapproach that uses

thebandwidth to space ratio(BSR) as a guiding metric. For example, by greedily select-

ing high BSR objects for reassignment, one can displace morebandwidth per unit of data

moved. Finally, we use various optimizations, including searching for multiple solutions,

to counter some of the pitfalls in a greedy approach.

We evaluate our techniques using a combination of simulation studies and an evalua-

tion of an implementation in the Linux kernel. Results from the simulation study suggest

that for a variety of system configurations our novel approach reduces the amount of data

moved to remove the hotspot by a factor of two as compared to other approaches. The

gains increase for a larger system size and magnitude of overload. Experimental results

from the prototype evaluation suggest that our measurementtechniques correctly identify

workload hotspots. For some simple overload configurationsconsidered in the prototype

our approach identifies a load-balanced configuration whichminimizes the amount of data

8

moved. Moreover, the kernel enhancements do not result in any noticeable degradation in

application performance.

1.4 Dissertation Outline

We now present a brief outline of the dissertation. In Chapter 2, we present an eval-

uation of object placement techniques in storage systems. In Chapter 3, we consider the

problem of self-managing bandwidth allocation in the context of a multimedia file server.

Chapter 4 discusses a learning based approach for dynamic bandwidth allocation to meet

the QoS requirements of multiple application classes. In Chapter 5, we address the problem

of automated object remapping to load balance large scale storage systems. We conclude

with a brief summary of the research contributions and future research directions in Chap-

ter 6.

9

CHAPTER 2

PLACEMENT TECHNIQUES IN A SELF-MANAGING STORAGESYSTEM

In Chapter 1 we argued that an ill-configured storage system can be detrimental for ap-

plication performance and may increase the burden on the administrator. A well-configured

system on the other hand would obviate the need for frequent reconfiguration. So, the initial

configuration of a storage system is particularly significant.

In this chapter, we focus on the initial configuration task ofobject placement. Stor-

age allocation algorithms that determine object placement, and thus the performance, are

crucial to the success of a storage system. For a self-managing storage system a suitable

placement technique is one that has low management overheadand delivers agreeable per-

formance.

Object placement techniques for large storage systems havebeen extensively studied in

the last decade, most notably in the context of disks arrays such as RAID [9, 20, 21, 26, 50].

Most of these approaches are based onstriping—a technique that interleaves the placement

of objects onto disks—and can be classified into two fundamentally different categories.

Techniques in the first category require a priori knowledge of the workload and use either

analytical or empirically derived models to determine an optimal placement of objects onto

the storage system [9, 20, 50]. An optimal placement is one that balances the load across

disks, minimizes the response time of individual requests and maximizes the throughput

of the system. Since requests accessing independent storescan interfere with one another,

these placement techniques often employ narrow striping—where each object is interleaved

across a subset of the disks in the storage system—to minimize such interference and pro-

vide isolation. An alternate approach is to assume that detailed knowledge of the workload

10

is difficult to obtain a priori and to use wide striping—whereall objects are interleaved

across all disks in the storage system. The premise behind these techniques is that stor-

age workloads vary at multiple time-scales and often in an unpredictable fashion, making

the task of characterizing these workloads complex. In the absence of precise knowledge,

striping all objects across all disks yields good load balancing properties. A potential lim-

itation though is the interference between independent requests that access the same set of

disks.

Although narrow striping is advocated both by the research literature and widely used

in practice, at least one major database vendor has recentlyadvocated the use of wide

striping to simplify storage administration [1, 38]. However, no systematic study of the

two techniques exists in the literature.

From the perspective ofmanagement complexity, these two techniques have fundamen-

tally different implications. A storage system that employs narrow striping will require

each allocation request to specify detailed workload parameters so that the system can

determine an optimal placement for the allocated store. In contrast, systems employing

wide striping will require little, if any, knowledge about the workload for making storage

allocation decisions. Thus, wide striped systems are easier to design and use, while narrow-

striped systems can potentially make better storage decisions. This results in a simplicity

versus performance tradeoff—wide striped systems advocate simplicity by requiring less

workload information, which can potentially result in worse performance. The opposite

is true for a narrow striped system. Narrow striping can extract performance gains only

if the workload specification is precise. It is nota priori evident if narrow striping can

make better storage decisions when the workload specification is imprecise or incorrect

(the accuracy of the workload information is not an issue in wide striping, since no such

information is required for placement decisions). Although placement of objects in large

storage systems have been extensively studied [7, 9, 10, 50], surprisingly, no systematic

11

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��Store 2

Store 1

Array 1 Array 2 Array N

Store K

Array 1

Store 1Store 2

Store 3

Store K

Disk

Array NArray 2

Store 3

Narrow Striping

Wide Striping

Figure 2.1.Narrow and wide striping in an enterprise storage system.

study of these tradeoffs of wide and narrow striping exists in the literature. Our work seeks

to address this issue by answering the following questions:� Is narrow or wide striping better suited for large scale storage systems? Specifically,

does the additional workload information required by narrow striping translate into

significant performance gains?� From a performance standpoint, how do narrow and wide striping compare against

one another? What is the impact of interference between requests accessing the same

set of disks in wide striping? Similarly, what is the impact of imprecise workload

knowledge and the resulting load skews in narrow striping?

2.1 Background and Problem Description

An enterprise storage system consists of a large number of disk arrays. A disk array is

essentially a collection of physical disks that presents anabstraction of one or more logical

disks to the rest of the system. Disk arrays map objects onto disks by interleaving data

from each object (e.g., a file) onto successive disks in a round-robin manner—a process

12

referred to asstriping. The unit of interleaving, referred to as astripe unit, denotes the

maximum amount of logically contiguous data stored on a single disk; the number of disks

across which each data object is striped is referred to as itsstripe width. As a result of

striping, each read or write request potentially accesses multiple disks, which enables ap-

plications to extract I/O parallelism across disks, and to an extent, prevents hot spots by

dispersing the application load across multiple disks. Disk arrays can also provide fault

tolerance by guarding against data loss due to a disk failure. Depending on the the exact

fault tolerance technique employed, disk arrays are classified into different RAID levels

[45]. A RAID level 0 (or simply, RAID-0) array is non-redundant and can not tolerate disk

failures; it does, however, employ striping to enhance I/O throughput. A RAID-1 array

employs mirroring, where data on a disk is replicated n another disk for fault-tolerance

purposes. A RAID-1+0 array combines mirroring with striping, essentially by mirroring

an entire RAID-0 array. A RAID-5 array uses parity blocks forredundancy—each parity

block guards a certain number of data blocks—and distributes parity blocks across disks in

the array.

With the above background consider a storage system that consists of a certain number

of RAID arrays. In general, RAID arrays in large storage systems may be heterogeneous—

they may consist of different number of disks and may be configured using different RAID

levels. For simplicity we assume that all arrays in the system are homogeneous. The pri-

mary goal in such systems is to allocate storage to applications such that their performance

needs are met. The storage is allocated on one or more arrays and is referred as astore[7];

the data on a store is collectively referred to as adata object(e.g., a tablespace, a file sys-

tem). The sequence of requests accessing a store is referredto as arequest stream. Thus,

we are concerned with the storage allocation problem at the granularity of stores and data

objects; we are less concerned about how each application manages its allocated store to

map individual data items such as files and database tables todisks.

13

We need to make two decisions when allocating a store to a dataobject: (1)RAID level

selection:The RAID level chosen for the store depends on the fault-tolerance needs of the

application and the workload characteristics. From the workload perspective, RAID-1+0

(mirroring combined with striping) may be appropriate for workloads with small writes,

while RAID-5 is appropriate for workloads with large writes.1 (2) Mapping of stores onto

arrays: One can map each store onto one or more disk arrays. If narrow striping is em-

ployed, each store is mapped onto a single array (and the dataobject is striped across disks

in that array). Alternatively, one may construct a store by logically concatenating storage

from multiple disk arrays and stripe the object across thesearrays (a logical volume man-

ager can be used to construct such a store). In the extreme case where wide striping is used,

each store spansall arrays in the system and the corresponding data object is striped across

all arrays (Figure 2.1 pictorially depicts narrow and wide striping). Since the RAID-level

selection problem has been studied in the literature [11, 61], we focus only on the problem

of mapping stores onto arrays.

The choice of narrow or wide striping for mapping stores ontoarrays results in different

tradeoffs. Wide striping can result in interference when streams accessing different stores

have correlated access patterns. Such interference occurswhen a request arrives at the disk

array and sees requests accessing other stores queued up at the array; this increases queuing

delays and can affect store throughput. Observe that, such interference is possible even in

narrow striping when multiple stores are mapped onto a single array. However, one can

reduce the impact of interference in narrow striping by mapping stores with anti-correlated

access patterns on to a single array. The effectiveness of such optimizations depends on the

degree to which the workload can be characterized preciselyat storage allocation time, and

the degree to which request streams are actually anti-correlated. No such optimizations can

be performed in wide striping, since all stores are mapped onto all arrays. An orthogonal

1Small writes in RAID-5 require a read-modify-write process, making them inefficient. In contrast, large(full stripe) writes are efficient since no reads are necessary prior to a write.

14

issue is the inability of wide striping to exploit the sequentiality of I/O requests. In wide

striping, sequential requests from an application get mapped to data blocks on consecutive

arrays. Consequently, sequentiality at the application level is not preserved at the storage

system level. In contrast, large sequential accesses in narrow striped systems result in

sequential block accesses at the disk level, enabling thesearrays to reduce disk overhead

and improve throughput.

Despite the above advantages, a potential limitation of narrow striping is its suscepti-

bility to load imbalances. Recall that, narrow striping requires a priori information about

the application workload to map stores onto arrays such thatthe arrays are load-balanced.

In the event that the actual workload deviates from the expected workload, load imbalances

will result in the system. Such load skews may require reorganization of data across arrays

to re-balance the load, which can be expensive. In contrast,wide striping is more resilient

to load imbalances, since all stores are striped across all arrays, causing load increases to

be dispersed across arrays in the system.

Finally, narrow and wide striping require varying amounts of information to be spec-

ified at storage allocation time. In particular, narrow striping requires detailed workload

information for load balancing purposes and to minimize interference from overlapping

request streams. In contrast, wide striping requires only minimal workload information to

determine parameters such as the stripe unit size and the RAID level.

The objective of our study is to quantify the above tradeoffsand to determine the suit-

ability of narrow and wide striping for large storage systems.

2.2 Experimental Evaluation

We evaluate the tradeoffs of narrow and wide striping using simulations and experi-

ments on a storage system testbed.

Our storage system simulator simulates a system with multiple RAID-5 arrays; each

RAID-5 array is assumed to consist of five disks (four disks and a parity disk, referred to as

15

Minimum Seek 0.6 msAverage Seek 4.7 msMaximum Seek 11.0 msRotational Latency 5.98 msRotational speed 10,000 RPMMaximum Transfer Rate 39.16 MB/s

Table 2.1.Characteristics of the Fujitsu Disk

a 4+p configuration). The data layout in RAID-5 arrays is left-symmetric. Each disk in the

system is modeled as an 18 GB Fujitsu MAJ3182MC disk; the characteristics of this disk

are shown in Table 2.1. The disk head movement is modeled as in[21]. We also incorporate

a write-back LRU cache to capture the effect of the storage controller cache. The cache size

is varied linearly with the number of arrays in the storage system, with 64 MB of cache per

array. The cache also employs an early destage policy to evict dirty buffers.

Our storage system testbed consists of a IBM TotalStorage FAStT-700 storage subsys-

tem equipped with 40 18 GB disks. The storage subsystem is connected to a 1.6 GHz Pen-

tium 4 server with 512 MB RAM running Linux 2.4.18 over Fibre Channel. The specific

RAID configurations used in our experiments are described inthe corresponding experi-

mental sections.

Depending on whether narrow or wide striping is used, each object (and the correspond-

ing store) is either placed on a single array or striped across all arrays in the system. We

assume each store is allocated a contiguous amount of space on each disk. Each data object

in the system is accessed by a request stream; a request stream is essentially an aggregation

of requests sent by different applications to the same store. For example, a request stream

for an OLTP application is the aggregation of I/O requests triggered by various transac-

tions. We use a combination of synthetic and trace-driven workloads to generate request

streams in our simulations; the characteristics of these workloads are described in the next

section.

16

2.2.1 Experimental Methodology

Recall that, narrow striping algorithms optimize storage system throughput by (i) collo-

cating objects that are not accessed together, (i.e., collocating objects with low or zero ac-

cess correlation so as to reduce interference), and (ii) balancing the load on various arrays.

The actual system performance depends on the degree to whichthe system can exploit each

dimension. Consequently, we compare narrow and wide striping by systematically study-

ing each dimension—we first varythe interference between request streams and then the

load imbalance.

Our baseline experiment compares a perfect narrow striped system with the correspond-

ing wide striped system. In case of narrow striping, we assume that all arrays are load

balanced (have the same average load) and that there is no interference between streams

accessing an array. However, these streams will interfere when wide striped and our ex-

periment quantifies the resulting performance degradation. Observe that, the difference

between narrow and wide striping in this case represents theupper boundon the perfor-

mance gains that can be accrued due to intelligent narrow striping. Our experiment also

quantifies how this bound varies with system parameters suchas request rates, request size,

system size, stripe unit size, and the fraction of read and write requests.

Next, we compare a narrow striped system with varying degrees of interference to a

wide striped system with the same workload. To introduce interference in narrow striping,

we assume that each array stores two independent objects. Wekeep the arrays load bal-

anced and vary the degree of correlation between streams accessing the two objects (and

thereby introduce varying amounts of interference). We compare this system to a wide

striped system that sees an identical workload. The objective of our experiment is to quan-

tify the performance gains due to narrow striping, if any, inthe presence of inter-stream

interference. Note that, narrow striped systems will encounter such interference in prac-

tice, since (i) it is difficult to find perfectly anti-correlated streams when collocating stores,

17

or (ii) imprecise workload information at storage allocation time may result in inter-stream

interference at run-time.

We then study the impact of load imbalances on the relative performance of wide and

narrow striping. Specifically, we consider a narrow stripedsystem where the load on arrays

is balanced using the the average load of each stream. We thenstudy how dynamic varia-

tions in the workload can cause load skews even when the arrays are load balanced based

on the mean load. We also study the effectiveness of wide striping in countering such load

skews due to its ability to disperse load across all arrays inthe system.

Our final set of experiments compare the performance of narrow and wide striping using

two well-known database benchmarks—TPC-C and TPC-H. We also study the effects of

interference and load variations on the two systems.

Together, these scenarios enable us to quantify the tradeoffs of the two approaches along

various dimensions. We now discuss the characteristics of the workloads used in our study.

Workload characteristics: We use a combination of synthetic workloads, real-world

traces and database benchmarks to generate the request streams in our study. Whereas

trace workloads are useful for understanding the behavior of wide and narrow striping in

real-world scenarios, synthetic workloads allow us to systematically explore the param-

eter space and quantify the behavior of the two techniques over a wide range of system

parameters. Database benchmarks, on the other hand, allow for comparisons based on

“standardized” workloads. Consequently, we use a combination of these workloads for our

study.

Our synthetic workloads are generated using two types of processes: (1)Poisson ON-

OFF process:The on and off periods of such a process are exponentially distributed. Re-

quest arrivals during the ON period are assumed to be Poisson. Successive requests are

assumed to access random locations on the store. The use of anON-OFF process al-

lows us to carefully control the amount of interference between streams. Two streams are

anti-correlated when they have mutually exclusive ON periods; they are perfectly corre-

18

Name Read req. Write req. Mean req. Requestrate rate size Streams

(IOPS) (IOPS) (bytes/req)OLTP 1 28.27 93.79 3466 24OLTP 2 74.31 15.93 2449 19

Web Search 1 334.91 0.07 15509 6Web Search 2 297.48 0.06 15432 6Web Search 3 188.01 0.06 15772 6

Table 2.2.Summary of the Traces. IOPS denotes the number of I/O operations per second.

lated when their ON periods are synchronized. The degree of correlation can be varied

by varying the amount of overlap in the ON periods of streams.(2) Closed-loop process:

A closed-loop process withconcurrencyN consists ofN concurrent clients that issue re-

quests continuously, i.e., each client issues a new requestas soon as the previous request

completes. The request sizes are assumed to be exponentially distributed and successive

requests access random locations on the store.

Both the Poisson ON-OFF and closed-loop processes can generate two types of request

streams—those that issue small requests and those that issue large requests. Streams with

large requests are representative of decision support systems (DSS), while those with small

requests represent OLTP applications. Since DSS workloadsaccess large amounts of data,

we assume a mean request size of 1MB for large requests. On theother hand, since OLTP

applications generate small requests, we use 4KB for small requests; the request sizes are

assumed to be exponentially distributed. Prior studies have used similar parameters [7].

The stripe unit sizes of the stores being accessed by large and small requests was set to be

512KB and 4KB, respectively.

We also use a collection of block-level I/O trace workloads for our study; these com-

prise i) traces of I/O workloads from OLTP applications of two large financial institutions

and have different mixes of read and write requests, and ii) traces from a popular web

search engine and consists of mostly read requests. the characteristics of these traces are

19

listed in Table 2.2. Traces labeled OLTP 1 and OLTP 2 are I/O workloads from OLTP

applications of two large financial institutions and have different mixes of read and write

requests. Traces labeled Web Search 1 through 3 are I/O traces from a popular web search

engine and consists of mostly read requests. Thus, the traces represent different storage

environments and, as shown in Table 2.2, have different characteristics.

2.2.2 Ideal Narrow Striping versus Wide Striping

We first compare a load-balanced, interference-free narrowstriped system with a wide

striped system using homogeneous and heterogeneous workloads. In case of homogeneous

workload, all streams generate requests of similar sizes. In case of heterogeneous workload,

streams generate requests of different sizes.

2.2.2.1 Comparison using Homogeneous Workloads

We compare narrow and wide striping, first for small request sizes and then for large

requests. Our simulations assume that each narrow striped array consists of a single store

(and a single request stream), while all stores are striped across all arrays in wide striping.

We use closed-loop workloads to generate requests streams;the concurrency factor for each

large and small closed-loop workload was assumed to be 2 and 4, respectively. We vary

the number of arrays in the system, i.e., the system size, andmeasure the average response

time in the two systems. Figures 2.2(a) and 2.2(b) depict theresponse times for large and

small request sizes, respectively, in the two systems. Whenthe system size is 1 (i.e., a

single array accessed by a single stream), narrow and wide striping are identical. Further,

since each request stream accesses a different array in narrow striping, the system size

has no impact on the response time. In other words, the performance of narrow striping is

represented by the system size of 1 (and remains unchanged).In contrast, the response time

of wide striping degrades with increasing system sizes. This is primarily due to increased

interference between request streams. However, as shown inFigure 2.2, the impact of such

20

0

5

10

15

20

25

30

35

40

45

1 2 3 4 5 6 7 8 9 10

Mea

n R

espo

nse

Tim

e (

ms)

System Size (# of arrays)

Large Requests

Homogeneous Workload

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10

Mea

n R

espo

nse

Tim

e (

ms)


Small Requests


(a) Large Requests (b) Small Requests

Figure 2.2. Effect of system size for homogeneous closed-loop workloads. System size of1 depicts narrow striping.

interference increases slowly with system size. Overall, we find that wide striping sees

response times that are 10-20% worse than narrow striping.

We validate the results of the above simulation experiment using our FAStT-700 storage

testbed. We configure the FAStT with two RAID-5 arrays (4+p configuration). We create

two stores, each 2 GB in size, on the storage system. For largerequests, the stripe unit

size of the store is 256 KB and for small requests it is configured to be 8 KB. We used the

Linux Logical Volume Manager(LVM) for wide striping the stores. The mean request size

for large and small requests is chosen to be 512 KB and 8 KB, respectively. We compare

narrow and wide striping using closed-loop workloads with different concurrency factors

(see Figure 2.3). As Figure 2.3 demonstrates, the response time in the wide-striped system

is about10 � 15% higher than in the narrow-striped system which is consistent with the

results of our simulations.

In addition to the above experiments, we compared narrow andwide striping by varying

a variety of system parameters such as the stripe unit size, the request size, the utilization

level, and the percentage of write requests. Our experiments were carried out for both

closed-loop and the open-loop Poisson OF-OFF workloads. Ineach case, we found that, if

21

��

��

��

��

��

0

5

10

15

20

25

30

0 1 2 3

Mea

n R

espo

nse

Tim

e (m

s)

Client Concurrency Factor

Large Requests

NarrowWide ��

��

��

��

��

��

��

��

0

2

4

6

8

10

12

1 2 3 4 5

Mea

n R

espo

nse

Tim

e (m

s)

Client Concurrency Factor

Small Requests

NarrowWide


Figure 2.3. Homogeneous Workload: Closed-loop Testbed Experiments

the stripe unit size is chosen carefully, the performance ofnarrow and wide striped systems

is comparable and within10 � 15% of one another. To avoid repetition, we present the

results from these experiments in the Appendix.

2.2.2.2 Comparison using Heterogeneous Workloads

To introduce heterogeneity into the system, we assume that each narrow striped array

consists of two stores, one accessed by large requests and the other by small requests (we

denote these request streams asLi andSi, respectively, wherei denotes theith array in the

system). In case of narrow striping, we ensure that only one of these streams is active at any

given time. This is achieved by assuming thatLi andSi are anti-correlated (have mutually

exclusive ON periods). We do not assume any correlations between streams accessing

independent arrays (i.e, between streamsLi andLj or Si andSj). Consequently, like

before, the narrow striped system is load-balanced and freeof inter-stream interference.

22

The wide striped system, however, sees a heterogeneous workload due the simultaneous

presence of small and large requests.

We use Poisson On-Off processes to understand the effect of various parameters such

as system size, stripe unit size, utilization level, etc. Asbefore, we assume a mean request

size of 1MB for large requests and 4KB for small requests. Thedefault stripe unit size is

chosen to be 512KB and 4KB for the corresponding stores. Unless specified otherwise,

we chose request rates that yield utilization of around 60-65%; this corresponds to a mean

inter-arrival (IA) time of 17 ms for large requests and 4 ms for small requests, respectively.

Effect of System Size:We vary the number of arrays in the system from 1 to 10 and

measure the average response time of the requests for wide and narrow striping. Since each

array is independent in narrow striping, the system size hasno impact on the performance

of an individual array. Hence, like before, the performanceof narrow striping is represented

by a system size of 1 (and remains unchanged, regardless of the system size). Figures 2.4(a)

and 2.4(b) show the response times on large and small requests, respectively, for varying

system sizes. The figure shows that while large requests see comparable response times in

wide striping, small requests see worse performance. To understand this behavior, we note

that two counteracting effects come into play in a wide-striped system. First, since stores

span all arrays, there is better load balancing across arrays, yielding smaller response time.

Second, requests see additional queues that they would not have seen in a narrow striped

system, which increases the response time. This is because wide-striped streams access

all arrays and interfere with one another. Hence, a small request might see another large

request ahead of it, or a large request might see another large request from an independent

stream, neither of which can happen in a narrow striped system. Our experiment shows that,

for large requests, as one increases the system size, the benefits of better load-balancing

balance the slight degradation due to the interference; this is primarily due to the large size

of the requests. For small requests, the interference effect dominates (since a large request

can substantially slow down a small request), leading to a higher response time in wide

23

0

5

10

15

20

25

30

35

40

45

1 2 3 4 5 6 7 8 9 10

Mea

n R

espo

nse

Tim

e (

ms)


Large Requests

Heterogeneous Workload

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 10

Mea

n R

espo

nse

Tim

e (

ms)


Small Requests

Heterogeneous Workload


Figure 2.4. Effect of system size for heterogeneous Poisson workloads.System size of 1depicts narrow striping.

striping. Observe that response time is higher by approximately the transfer time of a stripe

unit of large request.

Effect of Stripe Unit Size: In this experiment, we evaluate the impact of the stripe unit

size in wide and narrow striping. We vary the stripe unit sizefrom 64KB to 2MB for large

requests, and fix the stripe unit size for small requests at 4KB. Since the stripe-unit size of

small requests did not have much impact on performance, we omit these results.

First, consider the impact of varying the large stripe unit size on large requests (see

Figure 2.5(a)). When the large stripe unit is 64KB, a requestof 1MB size causes an average

of 16 blocks to be accessed per request. In case of narrow striping, since each stream is

striped across a 4+p array, multiple blocks are accessed from each disk by a request. Since

these blocks are stored contiguously, the disks benefit fromsequential accesses to large

chunks of data, which reduces disk overheads. In wide striping, each 1MB request accesses

a larger number of disks, which reduces the number of sequential accesses on a disk and

also increases the queue interference for both large and small requests. Consequently,

narrow striping outperforms wide striping for a 64KB stripeunit size. As we increase the

stripe unit size to 512 KB and beyond, the impact of loss in sequential access goes down.

24

0

20

40

60

80

100

120

140

160

64 128 256 512 1024 2048

Mea

n R

espo

nse

Tim

e (m

s)

Stripe-unit Size (KB)

Large Requests

System Size 1System Size 2System Size 3System Size 5

System Size 10System Size 15

0

10

20

30

40

50

60

64 128 256 512 1024 2048

Mea

n R

espo

nse

Tim

e (m

s)


Small Requests




Figure 2.5.Effect of varying the stripe unit size of large requests. System size of 1 depictsnarrow striping.

This coupled with the larger number of disk heads that are available for each request in

wide striping leads to better performance for wide striping. Since stripe unit is not varied

for small requests, it is impacted mainly by the utilizationlevels resulting from the different

stripe unit choices for large requests (see Figure 2.5(b)).For small requests, due to the

interference from large requests, wide striping leads to higher response time. Since disk

overhead, and consequently utilization, is higher in wide striping at smaller stripe unit sizes,

small requests see worse response times. As stripe unit sizeis increased the disk overhead

decreases and hence the relative response time performanceof wide striping improves.

But, beyond 512 KB, the transfer times of the large stripe units becomes significant, and

the response times of the small requests increases in wide striping.

Effect of the Utilization Level: In this experiment, we study the impact of the utiliza-

tion level on the response times of wide and narrow striping.We vary the utilization level

by varying the inter-arrival (IA) times of requests. We firstvary the IA times of large re-

quests from 11ms to 20ms with the IA time of small requests fixed at 4ms (see Figure 2.6).

We then vary the IA times of small requests from 2ms to 7ms withthe IA time for large

25

0

20

40

60

80

100

11 12 13 14 15 16 17 18 19 20

Mea

n R

espo

nse

Tim

e (m

s)

Mean Inter-arrival Time (ms)

Large Requests



0

5

10

15

20

25

30

35

40

11 12 13 14 15 16 17 18 19 20

Mea

n R

espo

nse

Tim

e (m

s)


Small Requests




Figure 2.6. Effect of the inter-arrival times of large requests. Systemsize of 1 depictsnarrow striping.

requests fixed at 17ms (see Figure 2.7). The various combinations of inter-arrival times and

background loads result in utilizations ranging from 40% to80%.

Figure 2.6(a) shows that, for large requests, wide stripingoutperforms narrow striping

at high utilization levels and has slightly worse performance at low utilization levels. This is

because, at higher utilization levels, the effects of striping across a larger number of arrays

dominate the effects of interference, yielding better response times in wide striping (i.e., the

larger number of arrays yield better statistical multiplexing gains and better load balancing

in wide striping). Small requests, on the other hand, see uniformly worse performance due

to the interference from large requests (see Figure 2.6(b)). The interference decreases at

lower request rates and reduces the performance gap betweenthe two systems.

The behavior is reversed when we vary the IA time of small requests (see Figure 2.7).

At low inter-arrival times, large requests see maximum interference from small requests,

and wide striping yields worse response times as a result. Asthe IA time is increased, the

interference decreases, and the load balancing effect dominates leading to better response

time in wide striping. For small requests, the response timedifference in narrow and wide

striping is always in the range of the transfer time for one stripe unit of a large request.

26

0

10

20

30

40

50

60

70

80

2 3 4 5 6 7

Mea

n R

espo

nse

Tim

e (m

s)


Large Requests



0

5

10

15

20

25

30

35

2 3 4 5 6 7

Mea

n R

espo

nse

Tim

e (m

s)


Small Requests




Figure 2.7. Effect of inter-arrival times of small requests. System size of 1 depicts narrowstriping.

Effect of Request Size:In this experiment, we study the impact of the request size of

large requests on the performance of wide and narrow striping. Varying the request size of

small requests (in the range 2KB-16KB) did not have much impact, so we omit the results.

We vary the average request size for large requests from 64KBto 2 MB (see Figure 2.8).

The stripe unit size was chosen to be half the average requestsize for large requests; the

average request size as well as the stripe unit size was fixed at 4 KB for small requests.

Figure 2.8(a) demonstrates that for large streams, initially (i.e., at small request sizes),

queue interference results in slightly higher (approximately average seek time) response

time in wide-striping. However, as the request size increases, the utilization increases and

wide-striping leads to lower response times due to better load balancing. On the other

hand, for small requests, wide-striping leads to larger response times, and the performance

difference increases as we increase the large request size due to the increased transfer times

of the large requests.

Effect of Writes: The above experiments have focused solely on read requests.In this

experiment, we study the impact of write requests by varyingthe fraction of write requests

in the workload. We vary the fraction of write requests from 10% to 90% and measure

27

0

50

100

150

200

250

16 32 64 128 256 512 1024 2048

Mea

n R

espo

nse

Tim

e (m

s)

Mean Request Size (KB)

Large Requests



0

5

10

15

20

25

30

35

40

45

50

16 32 64 128 256 512 1024 2048

Mea

n R

espo

nse

Tim

e (m

s)


Small Requests




Figure 2.8. Effect of request size of large requests. System size of 1 depicts narrowstriping.

their impact on the response times in the wide and narrow striped systems. Recall that we

simulate a write-back LRU cache.

We first vary the percentage of writes of the large requests with the small requests set to

be read only (see Figure 2.9). Due to the write-back nature ofthe cache, all write requests

return immediately after updating the cache. Consequently, the response times of write re-

quests is identical in both narrow and wide striping. Hence,the overall response times (for

both reads and writes) is governed mostly by read response times and the relative fraction

of reads. In general, increasing the percentage of write requests increases the background

load due to dirty cache flushes as well as the effective utilization (since the parity block also

needs to be updated on a write2). Both of these factors interfere with read requests. For

large requests, the increased interference is offset by thebetter load dispersion capability

of wide striping, causing wide striping to outperform narrow striping—this performance

advantage improves for larger system sizes (see Figure 2.9(a)). For small requests, on the

2Instead of reading the rest of the parity group, an intelligent array controller can read just the data block(s)being overwritten and the parity block to reconstruct the parity for the remaining data blocks. We assume thatthe array dynamically chooses between a read-modify write and this reconstruction write strategy dependingon which of the two requires fewer reads.

28

0

20

40

60

80

100

120

140

160

180

200

0 10 20 30 40 50 60 70 80 90

Mea

n R

espo

nse

Tim

e (m

s)

Percentage of Write Requests

Large Requests



0

50

100

150

200

250

0 10 20 30 40 50 60 70 80 90

Mea

n R

espo

nse

Tim

e (m

s)


Small Requests




Figure 2.9. Effect of percentage of large write requests. System size of1 depicts narrowstriping.

other hand, the interference effect dominates at low utilization, causing wide striping to

yield worse response times (see Figure 2.9(b)). As the percentage of writes is increased

beyond 70%, wide striping outperforms narrow striping. This is because the interference

from background cache flushes and parity updates becomes dominant in write-intensive

workloads, and wide striping yields better load balancing properties in the presence of

such interference.

Next we vary the percentage of writes for the small requests (see Figure 2.10). The

large request streams issue only read requests. As we increase the percentage of small

write requests we see that large requests see queue interference in a wide-striped system;

consequently narrow striping gives better performance. For small requests, as the write

percentage is increased and the utilization goes up, the role of load balancing becomes

significant and the performance of wide-striping improves,giving comparable performance

at high write percentages.

29

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80 90

Mea

n R

espo

nse

Tim

e (m

s)

Percentage of Writes

Large Requests



0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70 80 90

Mea

n R

espo

nse

Tim

e (m

s)

Percentage of Writes

Small Requests




Figure 2.10.Effect of percentage of small write requests. System size of1 depicts narrowstriping.

2.2.2.3 Summary

The above experiments compared a load-balanced, interference-free narrow striped sys-

tem with a wide striped system using homogeneous and heterogeneous workloads. Our ex-

periments demonstrate that, in the case of homogeneous workloads, narrow striping yields10� 15% better response times for some scenarios, while the two systems yield compara-

ble performance for most other scenarios. In case of heterogeneous workloads, our exper-

iments demonstrated that if the stripe unit size is chosen appropriately, then wide striping

yieldsbetterresponse times for large requests inmostscenarios. In some cases, wide strip-

ing yields higher response times (in the range of an average seek time). For small requests,

on the other hand, wide striping yields worse performance inmost scenarios (the perfor-

mance difference is in range of transfer time of a large stripe unit). In general, we find that

as utilization increases (for instance, by increasing write percentage), wide-striping leads

to better performance.

30

2.2.3 Impact of Inter-Stream Interference

While our experiments thus far have assumed an ideal (interference-free, load-balanced)

narrow-striped system, in practice, storage systems are neither perfectly load balanced nor

interference-free. In this section, we examine the impact of one of these dimensions—

inter-stream interference—on the performance of narrow and wide striping.

To introduce interference systematically into the system,we assume a narrow striped

system with two request streams,Li andSi, on each array. Each stream is an ON-OFF

Poisson process and we vary the amount of overlap in the ON periods of each(Li; Si) pair.

Doing so introduces different amounts of correlations (andinterference) into the workload

accessing each array. Initially, streams accessing different arrays are assumed to be uncor-

related (thus,Si andSj as well asLi andLj are uncorrelated for alli; j.). Like before, all

streams access all arrays in wide striping. We vary the correlation between each(Li; Si)pair from 0 to 1 and measure its impact on the response times oflarge and small requests

(correlation of 0 implies thatLi andSi are never on simultaneously while 1 implies that

they are always on simultaneously). We control the correlation by varying the overlap frac-

tion i.e., mean time for which the two streams are ON simultaneously. For simplicity we

assume that the correlated streams have the same ON periods;also we assume the OFF pe-

riod to have the same duration as the ON period. This gives us ahigh degree of control on

stream correlations. For a correlation ofx, 0 � x � 0:5, the overlap fraction is uniformly

distributed between0 and2*x. For correlations between0.5 and1, the overlap fraction is

uniformly distributed between2*x-1 and1.0.

Figure 2.11 plots the impact of correlation on response timein narrow and wide striped

systems. As the figure demonstrates, the performance of wide-striping improves with in-

crease in correlation, with wide-striping performing better for both small and large request

sizes for correlation values higher than0:25. Observe that as correlation increases, the

probability of temporary load-imbalance in the narrow striped system increases. Since

31

0

50

100

150

200

0 0.2 0.4 0.6 0.8 1

Mea

n R

espo

nse

Tim

e (m

s)

Mean Overlap Fraction (Correlation)

Large Requests



0

20

40

60

80

100

120

140

0 0.2 0.4 0.6 0.8 1

Mea

n R

espo

nse

Tim

e (m

s)

Mean Overlap Fraction (Correlation)

Small Requests




Figure 2.11.Impact of inter-stream interference. System size of 1 depicts narrow striping.

wide-striping yields better load-balancing, it leads to better performance as correlation in-

creases.

2.2.4 Impact of Load Skews: Trace-driven Simulations

We use the trace workloads listed in Table 2.2 to evaluate theimpact of load imbalance

on the performance of narrow and wide striping. The traces have a mix of read and write

I/Os and small and large I/Os. To illustrate, the OLTP-1 trace has a large fraction of small

writes (mean request size is 2.5KB), while the Web-Search-1trace consists of large reads

(mean request size is 15.5KB). Our simulation setup is same as the previous sections, ex-

cept that each request stream is driven by traces instead of asynthetic ON-OFF process.

Due to the high percentage of writes in the OLTP streams, a cache of sufficient size, re-

sulted in similar performance for both narrow and wide striping, when in the write back

mode; so in the following, we have the cache in the write through mode.

To compare the performance of narrow and wide-striping using these traces, we sepa-

rate each independent stream from the trace (each stream consists of all requests accessing

a volume). This pre-processing step yields 61 streams. We then eliminate 9 streams from

the search engine traces, since these collectively contained less that 1000 requests (and are

32

effectively inactive). We further eliminate 4 streams fromthe OLTP traces as these were

found to be capacity bound. We then partition the remaining 48 streams into four sets such

that each set is load-balanced. We use the write-weighted average IOPS3 of each stream as

our load balancing metric—in this metric, each write request is counted as four I/Os (since

each write could, in the worst case, trigger a read-modify-write operation involving four

I/O operations). Since the size of each I/O operation is relatively small, we did not consider

stream bandwidth as a criteria for load balancing.

We employ a greedy algorithm for partitioning the streams. The algorithm creates a

random permutation of the streams and assigns them to partitions one at a time, so that

each stream is mapped to the partition that results in the least imbalance (the imbalance is

defined to the difference between the load of the most heavily-loaded and the least lightly-

loaded partitions). We repeat the process (by starting withanother random permutation)

until we find a partitioning that yields an imbalance of less than 1%.

Assuming a system with four RAID-5 arrays, each configured with 4+p disks, we map

each partition to an array in narrow striping. All partitions are striped across all four arrays

in wide striping. We computed the average response time as well as the95th percentile of

the response times for each stream in two systems. Figure 2.12 plots the average response

time and the95th percentile of the response time, for the various streams (the X axis is

the stream id). As shown in the figure, wide striping yields average response times that

are comparable to that of a narrow striped system. Figure 2.12(c) shows the mean disk

utilizations for the disks in the system (the X axis is the disk id). Observe that variance in

the mean disk utilizations across the disks in the system is lower in a wide-striped system

due to better load balancing. Also, even for the case of narrow striping the variance in

disk utilizations is low since the partitions are load balanced (a partition comprises of five

consecutive disks).

3I/O Operations Per Second

33

0

50

100

150

200

250

300

350

0 5 10 15 20 25 30 35 40 45 50

Mea

n R

espo

nse

Tim

e (

ms)

Stream Id

Mean Response Time

NarrowWide

(a) Mean Response Time

0

200

400

600

800

1000

1200

0 5 10 15 20 25 30 35 40 45 50

Mea

n R

espo

nse

Tim

e (

ms)

Stream Id

95th Percentile of Response Time

NarrowWide

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16 18

Mea

n D

isk

Util

izat

ion

Disk No.

Disk Utilizations

NarrowWide

(b) 95th Percentile of the Response Time (c) Mean Disk Utilizations

Figure 2.12.Trace Driven Simulations

34

0

50

100

150

200

250

300

350

0 5 10 15 20 25 30 35 40 45 50

Mea

n R

espo

nse

Tim

e (

ms)

Stream Id

Mean Response Time

NarrowWide

(a) Mean Response Time

0

200

400

600

800

1000

1200

0 5 10 15 20 25 30 35 40 45 50

Mea

n R

espo

nse

Tim

e (

ms)

Stream Id

95th Percentile of Response Time

NarrowWide

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16 18

Mea

n D

isk

Util

izat

ion

Disk No.

Disk Utilizations

NarrowWide

(b) 95th Percentile of the Response Time (c) Mean Disk Utilizations

Figure 2.13.Trace Driven Simulations with Load Imbalance

Next we introduce load imbalance across the partitions and compare the performance

of narrow and wide striping. To introduce imbalance we simply scale the inter-arrival times

for all streams on the first two partitions by a factor of 0.75 (streams 0-25). In the narrow

striped system this increases the load on the first two partitions (disks 0-9) and the load

on the other two partitions remains unchanged; for a wide striped system however the load

across all the partitions goes up. Figure 2.13 plots the results. As can be seen in Figure 2.13

(c) the mean utilization across the first two partitions (first ten disks) has gone up in narrow

striping, and the utilization across all the partitions hadgone up for wide striping (compare

with Figure 2.12 (c)). A look at the plots for the average response time as well as the plot

35

for the95th percentile of the response time shows that wide striping outperforms narrow

striping for streams on the first two partitions, and their performance on the remaining

two partitions is similar. Thus we see that, due to better load balancing, wide striping

outperforms narrow striping in the presence of load imbalances.

2.2.5 Experiments on a Storage System Testbed

In this section, we compare the performance of narrow and wide striping on our storage

testbed using a synthetic workload and two database benchmark workloads—TPC-C and

TPC-H. system. For the synthetic and TPC-H workloads, we usea FAStT-700 storage

subsystem, and for the TPC-C workload, we use a SSA-based RAID subsystem.

2.2.5.1 Synthetic Workload

The workload consists of two closed-loop streams, one largeand one small, accessing

two independent stores on a RAID-5 array simultaneously. Each store was of size 2GB

and was created on a4 + p RAID-5 array on FAStT. For large requests the stripe unit

size of the store was 256 KB and for small requests it was configured to be 8 KB. We

used theLinux Logical Volume Manager(LVM) for wide striping the stores. The mean

request size for large and small requests was chosen to be 512KB and 8 KB, respectively.

Figure 2.14 shows the response time performance of narrow and wide-striping, for various

combinations of concurrency factors of the clients accessing the large and small stores,

respectively. As the experiments demonstrate, the performance of wide striping is within10� 15% of the narrow striped system.

2.2.5.2 TPC-H Workload

TPC-H is a decision support benchmark. It was used in [5] to illustrate the benefit of

narrow striping. We use a setup similar to the one in [5] with IBM DB2 UDB instead of MS

SQL server. We setup the TPC-H database on a 1.6 GHz Pentium 4 with 512 MB RAM

running Linux 2.4.18. This was connected to the FAStT-700 storage system using Fibre

36

��

��

��

��

��

��

��

��

��

0

5

10

15

20

25

30

35

40

(1,2) (1,3) (1,4) (2,2)

Mea

n R

espo

nse

Tim

e (

ms)

Client Concurrency Factor (large, small)

Large Requests

NarrowWide ��

��

��

��

��

��

��

��

��

��

0

5

10

15

20

25

(1,2) (1,3) (1,4) (2,2)

Mea

n R

espo

nse

Tim

e (m

s)

Client Concurrency Factor (large, small)

Small Requests

NarrowWide


Figure 2.14.Heterogeneous Workload: Closed-loop Testbed Experiments

37

Channel. The page size and the extent size for the database were chosen to be 4 KB and 32

KB, respectively. The scale factor for the database was set to 1 (a 1 GB database).

For narrow striping, we used the placement described in [5].The tablelineitemwas

spread uniformly over 5 disk drives,orders was spread uniformly over three other disk

drives, and all the other tables and indexes (including the indexes for the tableslineitem

andorder) were placed in a third logical volume (calledrest) which was striped across all

the 8 disk drives. In the wide-striped case the tableslineitemandorderswere striped across

all the 8 disk drives, as was the logical volumerest. In both cases the system temporary

tables were placed on a ninth disk drive, also on the FAStT-700. The stripe unit size was

chosen to be 32 KB in all cases.

Figure 2.15(a) shows the query execution times for narrow and wide striping for a single

stream run (power run) of TPC-H. Since this is an unaudited run; the query execution times

are normalized. As the figure demonstrates, most of the queries have similar execution

times. Only for queries 20 and 21 do we see a5 � 10% performance difference; narrow

striping outperforms wide striping for query 20, and vice versa for query 21.

Figures 2.15(b) and 2.15(c) plot the I/O profile for thelineitem, orders, andrestvolumes

for narrow and wide striping, respectively. The figure demonstrates that the I/O profile are

indeed very similar in both cases. It also demonstrates thatlineitem and orders are indeed

the two important tables and the narrow placement algorithmsuggested in [5] appear to

be valid for DB2 as well. Overall, we find that when the tables are carefully mapped to

arrays in narrow striping, the two systems perform comparably (note that, no placement

optimizations are necessary for wide striping).

2.2.5.3 TPC-C Workload

Our final experiment involves a comparison of narrow and widestriping using TPC-

C workload. Our testbed consists of a four-processor IBM RS6000 machine with 512

MB RAM and AIX 4.3.3. The machine contains a SSA RAID adapter card with two

38

20

40

60

80

100

0 2 4 6 8 10 12 14 16 18 20 22

Nor

mal

ized

Que

ry T

imes

Query ID

Query Times

NarrowWide

(a) Query Times

0

20

40

60

80

100

0 50 100 150 200 250 300

Nor

mal

ized

I/O

Rat

e

Time (secs)

Narrow Striping

LineitemOrders

Rest

0

20

40

60

80

100

0 50 100 150 200 250 300

Nor

mal

ized

I/O

Rat

e

Time (secs)

Wide Striping

LineitemOrders

Rest

(b) Narrow Striping: I/O Profile (c) Wide Striping: I/O Profile

Figure 2.15.Comparison using the TPC-H Benchmark

channels (also called SSA loops) and 16 9GB disks on each channel (total of 32 disks).

We configured four RAID-5 arrays, two arrays per channel, each in a7 + p configuration.

Whereas two of these arrays are used for our narrow striping experiment, the other two are

used for wide striping (thus, each experiment uses 16 disks in the system). The SSA RAID

card uses a stripe unit size of 64 KB on all arrays; the value ischosen by the array controller

and can not be changed by the system. However, as explained below, we use large requests

to emulate the behavior of larger stripe unit sizes in the system. We use two workloads in

our experiments:

39

� TPC-C benchmark: The TPC-C benchmark is an On-Line Transaction Processing

(OLTP) benchmark and results in mostly small size random I/Os. The benchmark

consists of a mix of reads and writes (approximately two-thirds reads and one-third

writes [7]). We use a TPC-C setup with300 warehouses and30 clients.� Large sequential: This is an application that reads a raw volume sequentially using

large requests. The process has an I/O loop that issues requests using theread()system call. Since we can not control the array stripe unit size, we emulate the effect

of large stripe units by issuing large read requests. We use two request sizes in our

experiments. To emulate a 128KB stripe unit size, we issue 896KB requests (since64KB � 7disks = 448KB , a 898 KB request will access two 64 KB chunks on

each disk). We also find experimentally that the throughput of the array is maximized

when requests of448KB � 16 = 7 MB are issued in a singleread call. Hence, we

use7MB as the second request size (which effectively requests 16 64KB blocks

from each disk).

We first experiment with a narrow striped system by running the TPC-C benchmark on

one array and the sequential application on the other array.We find the TPC-C through-

put to beN TpmC (the exact number withheld since this is an unaudited run), while the

throughput of the sequential application is25:43MB/s for 896 KB requests and29:45 MB/s

for 7MB requests (see Table 2.3).

We then experiment with a wide striped system. To do so, we create three logical

volumes on the two arrays using the AIX volume manager. Two ofthese volumes are

used for the TPC-C data, index, and temp space, while the third volume is used for the

sequential workload. As shown in Table 2.3, the TPC-C throughput is 1.33N TpmC when

the sequential workload uses 896 KB requests and is 0.82N TpmC for 7MB requests. The

corresponding sequential workload throughput is 20.09 MB/s and 36.86 Mb/s, respectively.

Thus, we find that for the sequential workload, small requests favor narrow striping,

while large requests favor wide striping. For TPC-C workload, the reverse is true, i.e., small

40

Striping Sequential Sequential Normalized TPC-CI/O Size Throughput Throughput

Narrow 896 KB 25.43 MB/s N TpmCWide 896 KB 20.09 MB/s 1.33 N TpmC

Narrow 7 MB 29.45 MB/s N TpmCWide 7 MB 36.86 MB/s 0.82 N TpmC

Table 2.3.TPC-C and Sequential Workload Throughput in Narrow and WideStriping

requests favor wide striping and large requests favor narrow striping. This is because the

performance of TPC-C is governed by the interference from the sequential workload. The

interference is greater when the sequential application issues large 7MB requests, resulting

in lower throughput for TPC-C. There is less interference when the sequential application

issues 898 KB (small) requests; further, TPC-C benefits fromthe larger number of arrays in

the wide striped system, resulting in a higher throughput. This behavior is consistent with

the experiments presented in previous sections. Furthermore, the performance difference

(i.e., improvement/degradation) between the two systems is around 20%, which is again

consistent with the results presented earlier.

2.3 Summary and Implications of our Experimental Results

Our experiments show that narrow striping yields better performance for small requests

when the streams can be ideally partitioned such that the partitions are load-balanced and

there is very little interference between streams within a partition. However, in the pres-

ence of workload skews that occur in real I/O workloads wide striping outperforms narrow

striping. In our trace-driven experiments, we found that when the the average load was

balanced, wide-striping performed comparably to narrow-striping. However, when we in-

troduced load imbalance by increasing the load on some partitions, wide striping outper-

formed narrow striping for the streams on the heavily loadedpartitions while performing

comparably for the remaining streams. With a TPC-C workload, we found that if the stripe

41

unit is chosen appropriately, then narrow and wide stripinghave comparable performance

even though there are no workload skews due to the “constant-on” nature of the bench-

mark. In our closed-loop testbed experiments and the TPC-H experiments we found the

performance of narrow and wide striping to be comparable.

In situations where it is beneficial to do narrow striping, significant efforts are required

to extract those benefits. First, the workload has to be determined either from an initial

specification or by system measurement. Since narrow placement derives benefits from ex-

ploiting the correlation structure between streams, the characteristics of the streams as well

as the correlations between the streams needs to be determined. It is not known whether

stream characteristics or the inter-stream correlations are stable over time. Hence, if the

assumptions made by the narrow placement technique change,then load imbalances and

hot-spots may occur. These hot-spots have to be detected andthe system re-optimized us-

ing techniques such as [9]. This entails moving stores between arrays to achieve a new

layout [39]. The process of data movement itself has overheads that can effect the per-

formance. Furthermore, data migration techniques are onlyuseful for long-term or per-

sistent workload changes; short-time scale hot-spots thatoccur in modern systems can not

be effectively resolved by such techniques. Thus, it is not apparent it is possible to ex-

tract the benefits of narrow-striping for dynamically changing (non-stationary) workloads.

Storage systems that employ narrow striping [7, 9] have onlycompared performance with

manually-tuned narrow striped systems. While these studies have shown that such sys-

tems can perform comparably or outperform human-managed narrow striped systems, no

comprehensive comparison with wide striping was undertaken in these efforts.

In contrast to narrow striping, which requires detailed workload knowledge, the only

critical parameter in wide striping seems to be the stripe unit size. Our experiments high-

light the importance of choosing an appropriate stripe unitfor each store in a wide striping

system (for example, large stripe units for streams with large requests). While an optimal

stripe unit size may itself depend on several workload parameters, our preliminary experi-

42

ments indicate that choosing the stripe unit size based on the average request size is a good

rule of thumb. For example, in our experiments, we chose the stripe unit to be half the

average request size. Detailed analytical and empirical models for determining the optimal

stripe unit size also exist in the literature [20, 21, 50].

For storage management one must also consider issues unrelated to performance when

choosing an appropriate object placement technique. For example, system growth has dif-

ferent consequences for narrow and wide striping. In case ofnarrow striping, when ad-

ditional storage is added, data does not have to be necessarily moved; data needs to move

only to ensure optimal placement. In case of wide-striping,data on all stores needs to be re-

organized to accommodate the new storage. Although this functionality can be automated

and implemented in the file system, volume manager, or raid controllers without requiring

application down-time, the impact of this issue depends on the frequency of system growth.

In enterprises environments, system growth is usually governed by purchasing cycles that

are long. Hence, we expect this to be an infrequent event and not be a significant issue

for wide-striping. In environments where system growth is frequent, however, such data

reorganizations can impose a large overhead.

A storage system may also be required to provide different response time or throughput

guarantees to different applications. The choice between narrow and wide striping in such

a case would depend on the Quality of Service (QoS) control mechanisms that are available

in the storage system. For example, if appropriate QoS-aware disk scheduling mechanisms

exist in the storage system [54], then it may be desirable to do wide striping. If no QoS

control mechanisms exist, a system can either isolate stores using narrow striping, or group

stores with similar QoS requirements, partition the systembased on storage requirements

of each group, and wide-stripe each group within the partition.

A final issue is system reliability. In narrow striping, whenmultiple disks fails on a

RAID array, only stores mapped onto that array are rendered unavailable. In contrast, all

stores are impacted by the failure of any one RAID array in wide striping. The overall

43

choice between wide and narrow striping will be dictated by acombination of the above

factors.

2.4 Related Work

The design of self-managing storage systems was pioneered by [7, 10, 9, 39], where

techniques for automatically determining storage system configuration were studied. This

work determines: (1) the number and types of storage systemsthat are necessary to support

a given workload, (2) the RAID levels for the various objects, and (3) the placement of the

objects on the various arrays. The placement technique is based on narrow striping. It

exploits access correlation between streams, and collocates bandwidth-bound and space-

bound objects to determine an efficient placement. The focusof our work is different; we

assume that the number of storage arrays as well as the RAID levels are predetermined and

study the suitability of wide and narrow striping for storage systems.

Analytical and empirical techniques for determining file-specific stripe unit, placing

files on disk arrays, and cooling hot-spots have been studiedin [20, 21, 37, 50]. Our

work addresses a related but largely orthogonal question ofthe benefits of wide and narrow

striping for storage systems.

While much of the research literature has implicitly assumed narrow striping, at least

one database vendor has recently advocated wide striping due to its inherent simplicity [38].

A cursory evaluation of wide striping combined with mirroring, referred to asStripe and

Mirror Everything Everywhere (SAME), has been presented in [1]; the work uses a simple

storage system configuration to demonstrate that wide striping can perform comparably to

narrow striping. To the best of our knowledge, ours is the first work that systematically

evaluates the tradeoffs of wide and narrow striping.

44

2.5 Concluding Remarks

Storage management cost is a significant fraction of the total cost of ownership of enter-

prise systems. Consequently, software automation of common storage management tasks

so as to reduce the total cost of ownership is an active area ofresearch. In this chapter,

we focused on the problem of storage space allocation. We studied two fundamentally

different storage allocation techniques: narrow and wide striping. Whereas wide striping

techniques need very little workload information for making placement decisions, narrow

striping techniques employ detailed information about theworkload to optimize the place-

ment and achieve better performance. We systematically evaluated this trade-off between

simplicity and performance. Using synthetic and real I/O workloads, we found that an ide-

alized narrow striped system can outperform a comparable wide-striped system for small

requests. However, wide striping outperforms narrow striped systems in the presence of

workload skews that occur in real systems; the two systems perform comparably for a va-

riety of other real-world scenarios. Our experiments demonstrate that the additional work-

load information needed by narrow placement techniques maynot necessarily translate to

better performance. Based on our results, we advocate narrow striping only when (i) the

workload can be characterized precisely a priori, and (ii) it is feasible to use data migration

to handle workload skews and workload interference. In general, we argue for simplicity

and recommend that (i) storage systems use wide striping forobject placement, and (ii) suf-

ficient information be specified at storage allocation time to enable appropriate selection of

the stripe unit size.

45

CHAPTER 3

SELF-MANAGING BANDWIDTH ALLOCATION IN AMULTIMEDIA FILE SERVER

In Chapter 2 we evaluated different placement techniques todetermine their suitability

for a self-managing storage system. Some management tasks,however, require attention

on a continual basis. In this chapter, we focus on automatingone such short-term reconfig-

uration task, namely bandwidth allocation.

Placement of data objects is the first task faced by a storage system administrator. Large

scale storage systems host data objects of multiple types which are accessed by applica-

tions with diverse service requirements. By partitioning disk bandwidth between appli-

cation classes one can (i) align the service provided with the application requirements,

and (ii) protect application classes from one another. A number of rate-based schedulers

that support class-based bandwidth reservations have beenproposed [13, 42, 43, 54, 60].

However, since workload changes dynamically, a static reservation may not be appropriate.

For better application performance it is desirable that with changing workload the band-

width be dynamically allocated to various classes. In this chapter, we focus on the specific

problem of self-managing bandwidth allocation in a multimedia file-server. By a multime-

dia file-server, we mean one that services a heterogeneous mix of conventional best-effort

and soft real-time streaming media workloads (as opposed tocontinuous media servers

that solely service streaming media workloads). By self-managing bandwidth allocation,

we mean techniques to monitor the file server workload and dynamically reallocate band-

width to various classes for improved application performance. We develop and evaluate

a measurement-based inference technique to address the problem. Note, that since such a

46

Text SM SMText Text SM

(a) Best−effort service

(b) Mutually−exclusive storage

(c) Reservations

Figure 3.1. Three techniques for supporting multiple application classes at a file server.

technique requires continual workload monitoring and bandwidth reallocation, it classifies

as a short-term reconfiguration task.

3.1 Self-Managing Bandwidth Allocation: Problem Definition

Consider a file server that services both streaming media andtraditional best-effort re-

quests. Most modern file servers belong to this category—they service requests for a mix

of streaming media, image and textual data (as anecdotal evidence, consider users who

store MP3 audio files and digital images along with traditional textual/numeric documents

in their home directories). The workload serviced by such a file server can be broadly clas-

sified into two categories: best-effort and soft real-time.The best-effort class comprises of

requests for traditional text/numeric and image data. Applications in this class need low

average response times or high aggregate throughput, but donot require any performance

guarantees. In contrast, the soft real-time class comprises of requests for streaming media

data; applications in this class impose deadlines that mustbe met but can tolerate an occa-

sional violation of these deadlines. Since the two classes have different characteristics and

performance requirements, modern file servers must addressthe challenge of reconciling

this heterogeneity.

47

A file server can employ one of three different techniques formanaging these two

classes (see Figure 3.1).� Best-effort service: In the simplest case, the file server does not employ any spe-

cialized techniques for managing the two classes and provides a simple best-effort

service to both textual and streaming media requests. In such a scenario, the perfor-

mance requirements of soft real-time requests can be met only by over-engineering

the capacity of the server and running the server at a low utilization levels. Since file

server workloads are often bursty [27], performance guarantees of real-time requests

are violated if a transient increase in the workload causes saturation. Another limita-

tion is that requests from the two classes can interfere withone another—a burst of

real-time requests can starve best-effort requests and vice versa. Due to these limi-

tations, the overall utility of this approach to streaming media applications is often

unsatisfactory.� Mutually exclusive storage: An alternate approach is to store files from the two

application classes on a mutually exclusive set of disks. Such a static partitioning of

storage resources precludes the possibility of interference between the two classes.

Moreover, guarantees of soft real-time requests can be met by employing simple ad-

mission control algorithms. Although conceptually simple, this approach has certain

limitations. In particular, this approach is feasible onlyso long as the placement

of files on disks can be carefully controlled (to ensure mutually exclusive storage

of files). Unless the mapping of files to disks can be transparently handled by the

file system, placing restrictions on end-users that dictatewhere to store each type of

file is cumbersome, since users are used to the simplicity of creating and grouping

arbitrary files in their directories. A more serious problemis that of performance—

studies have shown that the static partitioning of storage space and disk bandwidth

required by this approach results in up to a factor of six lossin performance (due to

the lack of statistical multiplexing) [53].

48

� Reservation-based approach:A third approach is to share storage space among

the two classes but reserve a certain fraction of the bandwidth on each disk for each

class (i.e., stores files from both classes on all the disks but reserve disk bandwidth for

each class). By sharing storage resources, the file server can extract statistical mul-

tiplexing gains; by reserving bandwidth, it can prevent interference among classes

and meet the performance guarantees of the soft real-time class. Thus, a reservation-

based approach overcomes the limitations of the previous two approaches. LetRrtdenote the fraction of the bandwidth reserved for the soft real-time class; the remain-

ing fractionRbe = 1�Rrt is used (reserved) for the best-effort class. The challenge

in designing a reservation-based approach lies in determining an appropriate parti-

tioning Rrt andRbe such that both classes see acceptable performance (i.e., meet

the deadlines of real-time requests while providing low average response times for

best-effort requests). Modern file systems such as SGI’s XFS[31] and IBM’s Tiger

Shark [29] support the notion of reservations. XFS, for instance, does so using its

guaranteed-rate I/O feature [31].

Due to the inherent advantages and flexibility of the reservation-based approach we

assume a file server that supports bandwidth reservations for each class.

There are several approaches for determine the aggregate bandwidth reservation for

each class. In the simplest case, the partitioning of bandwidth among the two classes can

be done manually. This can be done using past observations orfuture estimates of the

load to determine the long-term usage in each class. Whereasthis approach is feasible on

the time-scale of days, short-term variations on the time-scale of tens of minutes or hours

cannot be handled by the approach (since this would involve frequent manual intervention).

Further, since the partitioning must be recomputed every sooften to account for long-term

variations in the load within each class, the possibility ofhuman error can not be completely

eliminated.

49

An alternate approach is to automate the monitoring of the workload within each class

and dynamically partition the bandwidth among the two classes. We refer to such an ap-

proach asself-managing bandwidth allocation. By actively monitoring the load, the ap-

proach can react to workload changes on the time scale of minutes or hours. Furthermore,

the approach can also handle transient overloads in the system and ensure stable overload

behavior. A limitation of the approach, however, is that it increases the complexity of

the file server. The design of such a self-managing bandwidthallocator involves two key

challenges: (i) the design of efficient workload monitoringtechniques that have a minimal

impact on overall system performance, and (ii) the design ofadaptive techniques that use

past workload statistics to dynamically determine bandwidth allocation for the two classes.

We first address the simpler problem of self-managing bandwidth allocation in a single

disk file server and then use these insights to design a self-managing bandwidth allocator

for multi-disk servers.

3.2 Self-Managing Bandwidth Allocation in a Single Disk Server

In this section, we first present the system model assumed in our research. We then out-

line the requirements that must be met by a self-managing bandwidth allocator and finally

present an outline of the bandwidth allocation technique that meets these requirements.

3.2.1 System Model

Consider a single disk file server that services two classes of applications—best-effort

and soft real-time. Let us assume that the server reserves a certain fraction of the disk

bandwidth for each application class. LetRbe andRtr denote the reserved fractions, re-

spectively,0 � Rbe; Rrt � 1 andRbe = 1 � Rrt. Given the reservationsRbe andRrt, we

assume that the file sever employs a disk scheduling algorithm that can enforce these allo-

cations. A number of rate-based schedulers that support class-based bandwidth reservation

have been proposed [13, 42, 43, 54, 60]. Any such scheduler issuitable for our purpose

50

(since our bandwidth allocator does not make any specific assumptions about the schedul-

ing algorithm). It is possible that the scheduler may itselffurther partition the bandwidth

allocated to a class among individual applications. We are only concerned about the ag-

gregate bandwidth needs of each class; the partitioning of this aggregate among individual

applications is an orthogonal issue.

3.2.2 Requirements

Assuming the above system model, consider a bandwidth allocation technique that dy-

namically determines the fractionsRbe andRrt based on the load in each class. Such a

self-managing bandwidth allocator should meet four key requirements.� Time-scale of allocation and monitoring: Depending on the environment, band-

width allocation can be performed on the time-scales ranging from a few minutes

to tens of hours. Allocating bandwidth on (small) time-scales of minutes allows the

server to respond to short term variations in the load but canresult in frequent fluctu-

ations in the allocations. In contrast, allocating bandwidth on large time-scales (e.g.,

hours or days) allows the server to focus on long-term trendsin the workload while

effectively ignoring short term variations. Depending on the environment, small

time-scale or large time-scale allocation or both may be necessary. A bandwidth

allocator should allow a server administrator to specify the time-scale(s) of interest

and recompute allocations based on this specification.� Control over allocations: In addition to control over the time-scale of allocations,

the bandwidth allocator should allow control over the allocation itself. Allocating

bandwidth solely based on past usage can be problematic. Forinstance, if applica-

tions in a certain class are idle, its allocation can shrink to zero resulting in starvation

for future applications. To avoid such situations, the bandwidth allocator should per-

mit the server administrator to specify constraints on the allocations. This could be

done, for instance, by specifying a set of rules that govern the actual allocations.

51

X X X X X X X XX

window size W

Itime

measurements

Figure 3.2. A Moving Histogram� Stable overload behavior: A bandwidth allocator should exhibit stable behavior

even in the presence of transient overloads. Since the capacity of the server is ex-

ceeded during an overload, bandwidth allocation by itself can not remedy the situa-

tion. However, the allocator can (and should) make intelligent allocation decisions

that prevent unstable system behavior during overloads.� Exploit the semantics of each class:Requests within the best-effort class desire

low average response times, while those within the real-time class have associated

deadlines that must be met. Since the two classes have different performance re-

quirements, the allocator should exploit the semantics of each class and use different

criteria to allocate bandwidth to these classes. This can beachieved, for instance, by

using the average load to determine the allocation of the best-effort class and the tail

of the load distribution to determine the allocation of the real-time class.

Next we present our workload monitoring module and our adaptive bandwidth manager

that meets these requirements.

3.2.3 Monitoring the Workload in the Two Classes

The workload monitoring module tracks several parameters (listed below) that are rep-

resentative of the load within each class; the bandwidth manager then uses these parameters

to compute the allocation of each class. For each such parameter, the monitoring module

computes a probability distribution using the concept of amoving histogram. A moving his-

52

togram is simply a histogram computed over a moving time window. A moving histogram

is characterized by two parameters: the window sizeW and the measurement intervalI(see Figure 3.2). The window sizeW determines the interval of time over which the his-

togram is computed. Data values are recorded into the histogram everyI time units. Thus,

the parameter of interest is monitored over the measurementintervalI and the mean value

of that parameter over that interval is recorded into the histogram. The least recent value

is then dropped from the histogram, effectively sliding thewindow byI time units. Thus,

each histogram hasbWI data samples. By carefully choosingW andI, it is possible to

exercise control over the time-scale over which the load is monitored.

The monitoring module tracks various aspects of resource usage from the time a request

arrives to the time it is serviced by the disk. Monitored parameters include request arrival

rates, request waiting times and disk utilizations within each class (see Figure 3.3).� Request arrival rates: Over each intervalI, the module monitors the number of

request arrivals in each class (denoted byNbe andNrt) and the request sizes (SbeandSrt). The number of arrivals and the mean request size in that interval are then

recorded into moving histograms.� Request waiting times:Rather than monitoring the actual request waiting times, our

monitoring module uses queue lengths as an indicator of the time each request waits

in the system before it is serviced—larger the queue of outstanding requests, greater

is the waiting time. This is achieved by recording the instantaneous queue lengths of

the two classes (denoted byqbe andqrt) at the end of each intervalI.� Disk Utilizations: The module uses the disk utilizations as a measure of the actual

bandwidth consumed by each class. The utilization of a classis defined to be the

fraction of the time spent by the disk in servicing requests from that class. It is

computed asUbe = Pj � jbeI andUrt = Pj � jrtI , where� jbe and� jrt denote the time spent by

the disk in servicing an individual best-effort and soft real-time request, respectively.

53

RequestArrivals

Parameters monitored

Instantenous queue lengths (q)

Disk utilizations (U)

number of requests (N) request sizes (S)

Requestwait times

Requestservice times

Figure 3.3.Parameters tracked by the monitoring module

The utilizations within each class are then recorded into moving histograms at the

end of each intervalI.

3.2.4 Adapting the Allocation of Each Class

The bandwidth manager uses the histograms computed by the monitoring module to

periodically recompute the bandwidth allocation (reservation) of each class. The manager

provides control over the time-scale of allocation using a parameterP that defines the

period of these recomputations. Recall that the monitoringmodule uses a window sizeWfor each moving histogram. In general, the recomputation periodP can be smaller or larger

thanW . If allocations are recomputed more frequently thanW (i.e.,P < W ) then some

measurements used in the previous computations are reused to compute the new allocations

(since those measurements would still be contained in the windowW of the histogram).

In contrast, ifP > W , then some load measurements are never taken into account for

computing the allocations. Consequently, usingP = W is a good rule of thumb to ensure

a responsive file server. In the rest of this chapter, we assumeP = W .

54

The bandwidth manager uses a rule-based system to provide control over the allocation

to each class. Such a rule-based system supports a set of user-defined rules that govern

these allocations. Our bandwidth manager currently supports rules that specify upper and

lower bounds for each class. That is, a server administratorcan specify bounds (denoted

by [Rminbe ; Rmaxbe ℄ and[Rminrt ; Rmaxrt ℄) on the bandwidth allocated to each class. Bounds on

allocations are useful to prevent scenarios where a class receives either too little or too

much bandwidth (without such bounds, the allocation of the aclass could shrink to zero if

the class is idle, causing starvation for newly arriving requests).

Given the recomputation periodP and bounds on the allocation of each class, the band-

width manager estimates the bandwidth needs of each class using two metrics: (i) disk

utilizations and (ii) request arrival rates.

3.2.4.1 Estimating Bandwidth Requirement based on Disk Utilizations

The bandwidth manager uses the moving histograms of the diskutilizations to estimate

the bandwidth needs of each class. Since the two classes havedifferent performance char-

acteristics, a different metric is used to compute these estimates. In case of the best-effort

class, the bandwidth manager uses themedianof the utilization distribution, denoted byMedian(Ube), as an estimate of the bandwidth requirement (this is because requests in this

class desire low average response times). In contrast, a high percentile of the utilization,

denoted byPer (Urt), is used to estimate the requirements of the real-time class(since the

tail of the distribution better reflects the needs of real-time requests). The exact percentile

used to estimate the bandwidth requirements can be chosen statically or dynamically. In

the latter case, the percentile could be a function of the variance in the load—the greater

the variance, the higher the percentile used to estimate thebandwidth requirements. To

illustrate, the percentile can be chosen asbase per entile+ log(Cv) whereCv is the coef-

ficient of variation and is computed asCv = �(Urt)=E(Urt); E and� are the mean and the

standard deviation of the distribution.

55

After computing these utilizations, the bandwidth manageruses an exponential smooth-

ing function to weigh the current estimate with past estimates. That is,Median�(Ube) = � �Median(Ube) + (1� �) �Median�(Ube) (3.1)

and Per �(Urt) = � � Per (Urt) + (1� �) � Per �(Urt) (3.2)

where� is the exponential smoothing parameter,0 � � � 1. A large value of� biases

the estimates towards the immediate past measurements, whereas a small� reduces the

contribution of recent measurements.

3.2.4.2 Estimating Bandwidth Requirement based on the Arrival Rate

Whereas the actual disk utilization is a good indicator of the needs of each class when

the disk in not saturated (no overload), a different metric is needed during periods of tran-

sient overloads. This is because the total disk utilizationis always 100% during an overload

and no longer reflects the relative needs of each class. Consequently, the bandwidth man-

ager uses request arrival rates to estimate the bandwidth needs of each class during transient

overloads. In general, a class with larger arrival rates should be allocated a larger propor-

tion of the disk bandwidth. Observe that since the capacity of the disk is exceeded during

an overload, no allocation can actually satisfy thetotal bandwidth needs of two classes.

In such a scenario, the goal of the bandwidth manager should be to ensure stable overload

behavior and ensure that the allocations reflect therelativeneeds of the two classes.

To estimate the bandwidth needs based on arrival rates, the bandwidth manager first

computes the number of requests arriving in each class and the request size and uses a

simple disk model to estimate the bandwidth needs. As in the case of disk utilization,

exponentially smoothed values of the median and a high percentile of these distributions

56

are used for the best-effort and real-time class, respectively. Thus, the bandwidth needs of

the best-effort class are computed asBbe = Median�(Nbe) � (tseek + trot + Median�(Sbe)txfr ) (3.3)

and those of the soft real-time class are computed asBrt = Per �(Nrt) � (tseek + trot + Per �(Srt)txfr ) (3.4)

wheretseek, trot andtxfr denote the average seek overload, average rotational latency and

the data transfer rate of the disk, respectively. Note that the first term in the above expres-

sion represents the number of disk requests, while the second term represents the time to

service each disk request.

3.2.4.3 Computing the Reservations of Each Class

The bandwidth manager begins by initializing the allocation of each class to a user-

specified value (Rinitbe andRinitrt ). After each interval ofP time units, the bandwidth man-

ager estimates the bandwidth needs of each class, using the techniques described above,

and then computes the new allocations using the following algorithm.� Case 1:Neither class utilizes its entire allocation. This scenario occurs whenMedian�(Ube) < Rbe andPer �(Urt) < Rrt. Since neither class is utilizing its

entire allocation, no action is necessary. Hence, the allocations of the two classes

remains unchanged.� Case 2:The best-effort class utilizes its entire allocation.This scenario occurs whenMedian�(Ube) � Rbe andPer �(Urt) < Rrt. Since the best-effort class utilizes or

57

exceeds its allocated share1 and the real-time class is under-utilized, the bandwidth

manager should increase the allocation of the best-effort class (and correspondingly

decrease the allocation of the real-time class). This is achieved by settingRnewbe =Median�(Ube) (3.5)

The allocation of the real-time class is then set toRnewrt = 1� Rnewbe .� Case 3:The real-time class utilizes its entire allocation.In this scenario,Median�(Ube) < Rbe andPer �(Urt) � Rrt. Since load in the real-time class

equals or exceeds its allocation, the allocation of this class should be increased ap-

propriately. Consequently, the bandwidth manager sets thenew allocation of the

class to Rnewrt = Per �(Urt) (3.6)

The allocation of the best-effort class is set toRnewbe = 1�Rnewrt .� Case 4: Overload. An overload is said to occur when both classes use up their

entire allocations (resulting in saturation) or the queue of pending requests exceeds a

threshold. That is, (i)Median�(Ube) � Rbe andPer �(Urt) � Rrt; or (ii) qbe � Qor qrt � Q, whereQ is a large threshold. Since disk utilizations are not representative

of the relative requirements of the two classes during an overload, the bandwidth

manager uses the request arrival rate to compute the allocation of each class. Given

the bandwidth estimates,Bbe andBrt, based on arrival rates, the new allocations are

computed as Rnewbe = BbeBbe +Brt (3.7)

1Depending on the scheduling algorithm, an application class might use more bandwidth than its reservedshare. This happens when the other class is under-utilized and the scheduler reallocates unused bandwidth toneedy applications in the first class.

58

and Rnewrt = BrtBbe +Brt (3.8)

As explained earlier, the use of the relative bandwidth needs of the two classes to

compute allocations results in more stable overload behavior.

The above allocations are then constrained (if necessary) using user-specified bounds[Rminbe ; Rmaxbe ℄ and[Rminrt ; Rmaxrt ℄.Our adaptive algorithm has the following salient features:(1) it provides control over

the time-scale of monitoring and allocation via two tunableparameters:P (= W ) and�(in general, larger recomputation periods and smaller�s bias the allocator to long-term

variations in the load), (2) it allows control over the allocation via a set of rules to constrain

the allocation, (3) it employs techniques to provide stableoverload behavior, and (4) it

exploits the semantics of each class by using different metrics (median and percentiles of

the distribution) to estimate bandwidth needs. Thus, the bandwidth allocator meets all of

the requirements outlined in Section 3.2.2.

In what follows, we show how to enhance this technique to allocate bandwidth in multi-

disk servers.

3.3 Self-Managing Bandwidth Allocation in a Multi-disk Server

Due to the sheer volume of data stored on servers, modern file servers employ multiple

disks or disk arrays as their underlying storage medium. A multi-disk server can employ

one of two placement techniques to store files—each file can bemapped to a single disk or

the server can employ striping to interleave the storage of afile across multiple disks. In the

former case, the load on each disk is independent of the load on remaining disks, whereas

in the latter case the load on disks are related to one another. It is trivial to extend our

self-managing bandwidth allocation technique to multi-disk servers where each file maps

onto a single disk—since the disk loads are independent, theallocator can monitor a disk

and allocate bandwidth independently of other disks. A different technique is needed when

59

files are striped across multiple disks or when it is desirable to treat multiple independent

disks as a single logical storage device for purposes of bandwidth allocation.

One possible approach is to monitor the load on each disk and first compute the alloca-

tion on individual disks using the algorithm described in Section 3.2.4.3. The actual allo-

cation of each class is then set to the mean allocation over all disks in the array. Whereas

such an approach results in satisfactory performance for the best-effort class, it can ad-

versely affect the performance of the real-time class. Thisis because the load on various

disks can be different and the use of the average load to determine the allocation of the

real-time class can affect requests accessing heavily loaded disks. An alternate approach

is to set the allocation of each class to that on the most heavily loaded disk in the system.

However, a problem with the approach is that the load on the most heavily loaded disk can

significantly differ from that on the average loaded disk andusing the load on the former to

govern the allocation on the latter can cause a mismatch between the allocation and the ac-

tual load (thereby defeating the purpose of bandwidth allocation). Thus, neither approach

is satisfactory for allocating bandwidth on a disk array.

In what follows, we present a hybrid approach that takes intoaccount the load on the

heavily loaded disks as well as the average load to compute the allocations of the two

classes. We use the same notation as that in the single disk case with an additional super-

script to denote a particular disk (thusRibe denotes the allocation of the best-effort class on

disk i). Based on the load parameters tracked by the monitoring module, we first compute

the allocations on individual disks as follows:Ribe = 8><>: Median�(U ibe) if no overloadBibeBirt+Bibe if disk i is overloaded(3.9)

and Rirt = 8><>: Per �(U irt) if no overloadBirtBirt+Bibe if disk i is overloaded(3.10)

60

The average allocation of the best-effort class across all disks is thenRavgbe = avg(R1be; R2be;: : : ; RDbe) and the maximum allocation of the class on any disk isRmaxbe = max(R1be; R2be;: : : ; RDbe), whereD denotes the number of disks in the array. The average and the maximum

allocations of the real-time class across all disks can be computed similarly. The bandwidth

manager then computes the allocation of each class as a linear combination of the average

and the maximum load. That is,Rbe = �Rmaxbe + (1� ) �Ravgbe (3.11)

where the parameter , 0 � � 1, determines the contribution of the average and the

maximum load to the final allocation. Similarly, the allocation of the real-time class isRrt = �Rmaxrt + (1� ) �Ravgrt (3.12)

Finally, since the fractionsRbe andRrt may not sum to 1 (due to the skew between

the average and maximum loads and the parameter ), the final allocation is normalized as

follows: Rnewbe = RbeRbe +Rrt ; Rnewrt = RrtRbe +Rrt (3.13)

As in the single-disk case, the new allocations are constrained (if necessary) using the user-

specified upper and lower bounds. These allocations are thenused on each individual disk

for the nextP time units.

Observe that Equations 3.11 and 3.12 are key to multi-disk bandwidth allocation—

the choice of an appropriate helps balance the contribution of heavily loaded disks and

average loaded disks to the final allocation for each class.

61

3.4 Experimental Methodology

We evaluate the efficacy of our self-managing bandwidth allocator using a simulation

study. In what follows, we describe the simulation environment and the workload charac-

teristics used in our experiments and then describe our experimental results.

3.4.1 Simulation Environment

We used an event-based disk simulator to evaluate our bandwidth allocation technique.

Our simulator can simulate both single-disk and multi-diskservers. In either case, we

assume that the server supports two application classes—best-effort and soft real-time.

Requests from these classes are assumed to be serviced usingthe Cello disk scheduling

algorithm [54]. The Cello disk scheduler supports reservations for each class and uses

class-specific policies to service requests in the two classes; the SCAN policy is used to

service best-effort requests, while SCAN-EDF is used to service real-time requests with

deadlines. Note that, any other disk scheduler that supports class-specific reservations can

be used in conjunction with our bandwidth allocator withoutsignificantly affecting our

results. The file server is assumed to use one or more Seagate Elite-3 disks to store files

from the two application classes.2 The block size used for storing text files is assumed to be

4KB, while that for the video files is 64KB. In case of disk-arrays (i.e., a multi-disk server),

all files are assumed to be striped across disks in the array.

The workload monitoring module employed by the simulator tracks various load pa-

rameters as described in Section 3.2.3. The moving histograms computed by the module

are the used by the bandwidth manager to compute the allocation for each class (as de-

scribed in Sections 3.2.4 and 3.3). The allocation of each class is assumed to be initialized

toRinitbe = Rinitrt = 0:5 at the beginning of each simulation experiment.

2The Seagate Elite disk has an average seek overhead of 11 ms, an average rotational latency of 5.55 msand a data transfer rate of 4.6 MB/s.

62

Number of read/write operations 218724Average bit rate (original) 218.64 KB/sAverage bit rate (with 64MB cache) 83.91 KB/sAverage inter-arrival (original) 9.14 msAverage inter-arrival (with 64MB cache) 22.53 msAverage request size 2048.22 bytesPeak to average bit rate (1s intervals) 12.51

Table 3.1.Characteristics of the Auspex NFS trace

3.4.2 Workload Characteristics

We use two types of workloads in our experiments: trace-driven and synthetic. Our

trace workloads have been gathered from a real file-server and enable us to determine the

efficacy of our methods for real-world scenarios. However, since a trace workload only

represents a small subset of the operating region of a file server, we use synthetic work-

loads to systematically explore the state space. Next we describe the characteristics of the

workloads used in our experiments.

3.4.2.1 Best-effort Text Clients

We used portions of a NFS trace gathered from an Auspex file server at Berkeley to

generate the trace-driven text workload [23]. The characteristics of these workloads are

shown in Table 3.1. We assumed a 64MB LRU buffer cache at the server and filtered out

requests resulting in cache hits from the original trace; the remaining requests are assumed

to result in disk accesses. Figure 3.4 illustrates the characteristics of the resulting workload.

As shown in the figure, the text workload is very bursty; the peak to average bit rate of the

trace was measured to be 12.5.

To systematically explore the state space, we also use a synthetically generated text

workload. Each text client in the synthetic workload is assumed to be sequential or random.

The simulator allows control over the fractionsf and1� f of sequential and random text

clients in the workload. Clients are assumed to arrive and depart at random time instants.

63

0

500000

1e+06

1.5e+06

2e+06

2.5e+06

3e+06

0 250 500 750 1000 1250 1500 1750 2000

Byt

es/s

Time (s)

Figure 3.4.Bursty nature of the NFS trace workload.

Inter-arrival times of clients are assumed to be exponentially distributed. Upon arrival,

each client is assumed to access a random file and file sizes (and hence, client life times)

are assumed to be heavy-tailed with a Pareto distribution. These assumptions, namely

exponential interarrivals and Pareto file sizes, are consistent with studies of real-world text

clients [16, 27].

3.4.2.2 Soft Real-time Video Clients

Each video client in our simulator emulates a video player and reads a randomly se-

lected video file at a constant frame rate (e.g., 30 frames/s). Depending on the compression

algorithm, the selected video file may have a constant or a variable bit rate. Table 3.2 lists

the characteristics of video files used in our simulations. As shown in the table, we use

a mix of high bit-rate MPEG-1 files and low bit-rate MPEG-4 files. Since much of the

existing online streaming media content is low bit-rate (e.g., WindowsMedia, RealMedia),

this allows us to experiment with existing workloads as wellfuture higher bit-rate work-

loads. All video clients are assumed to be serviced in the server-push (streaming) mode.

The server services these clients in periodic rounds by retrieving a fixed number of frames

in each round. Disk requests for all active video clients areissued at the beginning of each

64

File Type Length Bit rate(frames)

Frasier MPEG-1 5960 1.49 Mb/sNewscast MPEG-1 9000 2.33 Mb/s

Silence of the Lambs MPEG-4 89998 107 Kb/s

Table 3.2.Characteristics of Video traces

round and have the end of that round as their deadlines. The round duration was set to

1000ms in our simulations.

We used observations from a recent study of an actual streaming media workload [22]

to simulate the arrival process for video clients (since thetraces used in that study are not

publicly available, we couldn’t use the trace itself). Video clients are assumed to arrive and

depart at random instants. Inter-arrival times are exponential, the object popularity is Zipf,

and the client life-times are heavy-tailed. We assumed no correlation between object sizes

and object popularity, consistent with observations made in recent studies [16].


In what follows, we present the results of our experimental evaluation using the trace

and synthetic workloads described in the previous section.

3.5.1 Ability to Adapt to Changing Workloads

In this experiment, we show how our bandwidth allocation technique can adapt to

changing workloads. We assume a single disk server and construct a workload scenario

that exercises all four cases of the allocation algorithm listed in Section 3.2.4.3. To do so,

we assume synthetic text and video clients that arrive and depart at random instants. Text

clients are assumed to be sequential and access 10KB of the file every 250ms. Each video

client is assumed to access a MPEG-1 file. The window sizeW and the recomputation

periodP were set to 100 seconds, the measurement intervalI was 1s and the smoothing

65

parameter� was 0.75. The percentile used for estimating the needs of thereal-time class

was set to 90.

Figure 3.5(a) depicts the variation in the number of text andvideo clients over the

duration of the experiment (note that the figure only denotesthenumberof clients in each

class, not their aggregate bandwidth requirements). We start with a small number of text

and video clients att = 0. At t = 500, there is a sudden burst of new video client arrivals

(triggering case 3 in Section 3.2.4.3). The video burst subsides att = 1500 and a burst of

text clients occurs att = 2000 (case 2). Att = 4000, there is a simultaneous burst of text

and video requests, resulting in transient overload at the server (case 4).

Figure 3.5(b) shows the allocations of the two classes for this workload, while Figures

3.5(c) and (d) plot the utilization of each class with the corresponding allocations. As

shown in Figure 3.5(b), the allocation of the real-time class increases att = 500 due to the

video burst, while that of the best-effort class increases at t = 2000 due to the text burst. Att = 4000, the server experiences an overload and the bandwidth manager uses the request

arrival rates to determine the allocations. Moreover, Figure 3.5(c) shows that allocation of

the real-time class is always a high percentile of the load (evident from the relative values

of the allocation and the utilization), whereas Figure 3.5(d) shows that the allocation of

the best-effort class is median value of the utilization. Finally, observe that in the periods1500 � t � 2000 and3000 � t � 3500, neither class utilizes its allocated share, causing

the allocations to remain unchanged (case 1).

3.5.2 Bandwidth Allocation in a Single-disk Server

In this experiment, we demonstrate the efficacy of our approach for a single disk server.

Whereas we performed experiments with both trace and synthetic workloads, due to space

constraints we present our results only for trace workloads.

66

0

5

10

15

20

25

2000 4000 6000 8000 10000

No.

of C

lient

s in

App

licat

ion

Cla

ss

Time (secs.)

Real timeBest effort

0

0.2

0.4

0.6

0.8

1

2000 4000 6000 8000 10000

Allo

catio

n

Time (secs.)


(a) Workload (b) Allocations

0

0.2

0.4

0.6

0.8

1

2000 4000 6000 8000 10000

Util

izat

ion

Time (secs.)

Class allocationClass utilization

0

0.2

0.4

0.6

0.8

1

2000 4000 6000 8000 10000

Util

izat

ion

Time (secs.)


(c) Utilization of the real-time class (d) Utilization of the best-effort class

Figure 3.5.Adaptive allocation of disk bandwidth

0

0.2

0.4

0.6

0.8

1

2000 4000 6000 8000 10000

Allo

catio

n

Time (secs.)


0

0.2

0.4

0.6

0.8

1

2000 4000 6000 8000 10000

Util

izat

ion

Time (secs.)


(a) Bandwidth allocations (b) Utilization of the best-effort class

Figure 3.6.Bandwidth allocation in a single-disk server.

67

Our experiment uses NFS traces (with a scale factor of 3) to generate a bursty text

workload3, while keeping the video load fixed over the duration of each simulation run.

We repeated the experiment for background video loads ranging from 1 to 10 simultaneous

MPEG clients. This enabled us to study the impact of a bursty text load with varying

background video loads. Each run simulates 2.8 hours of the workload on the file server.

Note also that while the number of video clients is fixed for each simulation run, each client

may impose a varying load due to the variable bit rate nature of video files.

Figures 3.6(a) and (b) plot the allocation of the two classesand the utilization of the

best-effort class for one such combination (namely, NFS workload with 7 background

video clients). As shown in the Figure 3.6(b), the allocation of the best-effort class closely

matches the disk utilization of that class, thereby demonstrating the effectiveness of the

bandwidth allocator.

3.5.3 Bandwidth Allocation in a Multi-disk Server

In this experiment, we demonstrate the efficacy of our approach for a multi-disk server.

Like in the single disk case, we conducted experiments with both trace and synthetic work-

loads. Due to space constrains, we only present our results for synthetic workloads.

We assumed a multi-disk server with eight disks. Both text and video files are assumed

to be striped across all disks in the array. The parameter that determines the contribution

of the maximum load and the average load across disks was chosen to be 0.75. Like in the

single disk case, we choseW = P = 100s andI = 1s.

The inter-arrival times of text clients were exponentiallydistributed with a mean of 10s

and the lifetimes of these clients were heavy-tailed with a mean of 4 minutes. Half of the

text clients were sequential and the other half random. Inter-arrival times of video clients

were also exponential with a mean of 1 minute, with a heavy-tailed lifetime of 4 minutes.

3The scale factor scales the interarrival times of requests and allows control over the burstiness of theworkload.

68

0

0.2

0.4

0.6

0.8

1

2000 4000 6000 8000 10000

Allo

catio

nTime (secs.)


(a) Allocations

0

0.2

0.4

0.6

0.8

1

2000 4000 6000 8000 10000

Util

izat

ion

Time (secs.)

Class allocationMaximum utilization

0

0.2

0.4

0.6

0.8

1

2000 4000 6000 8000 10000

Util

izat

ion

Time (secs.)

Class allocationMean utilization

(b) Maximum Utilization of RT class (c) Mean Utilization of RT class

Figure 3.7.Bandwidth allocation in a multi-disk server.

The popularity of video files was Zipf with a parameter of 0.47[22]. These parameters were

chosen such that the text load was mostly stable, while the video load steadily increased

over the duration of the experiment, eventually resulting in an overload.

Figure 3.7 (a) shows the allocation of the two classes as computed by our multi-disk

bandwidth allocator. Figures 3.7(b) and (c) plot the maximum utilization of the soft real-

time class on any disk and the mean utilization across all disks, respectively (along with

the corresponding allocations). As expected, we see that the allocation of the soft real-time

class increases steadily with the load. Eventually, some ofthe disks in the array experience

an overload and our allocator uses request arrival rates to compute the allocations. Note

69

also that since we chose = 0:75, the allocation on the average disk is slightly larger than

the utilization on that disk.

3.5.4 Impact of Tunable Parameters

In this section, we show how tunable parameters such as the recomputation periodP (=W ) and the smoothing parameter� can be used to control the time-scale of bandwidth

allocations.

The video load for this experiment was kept fixed over the duration of the simulation.

The text load is initially steady for the first 2200 seconds and a burst occurs between2200 �t � 2800 (the burst is characterized by a sharp increase in the numberof text clients

followed by a sharp decrease). Figure 3.8(a) plots this variations in the text load.

We varied� from 0.25 to 1 and computed the allocations of the best-effort class. In

general, a large value of� causes the bandwidth manager to maintain less history and

biases the allocations towards more recent measurements. This allows the server to react to

small variations in the load. In contrast, small values of� smooth out recent variations in

the load, making the server less sensitive to recent load changes. Figure 3.8(b) demonstrates

this behavior for different values of�. As shown in the figure, when� = 1 the bandwidth

manager quickly increases the allocation of the best-effort class to match the increase in

utilization due to the burst. The increase in allocation is slower for smaller values of�.

For instance, when� = 0:25 the allocation increases slowly to 60% and doesn’t increase

further since the burst subsides quickly.

Next, we variedP and studied its effect on the allocation. Figure 3.8(c) depicts the al-

location of the best-effort class for different values ofP . A larger recomputation period al-

lows the bandwidth manager to focus on long-term trends and ignore short-term variations,

while a smaller recomputation period enables the server to respond to short-term variations.

Figure 3.8(c) demonstrates this behavior. WhenP = 100, the allocation of the best-effort

class quickly increases to match the increase in the load. Incontrast, whenP = 500 the

70

0

0.2

0.4

0.6

0.8

1

1000 2000 3000 4000 5000

Util

izat

ion

Time (secs.)

Workload

(a) Utilization

0

0.2

0.4

0.6

0.8

1

1000 2000 3000 4000 5000

Allo

catio

n

Time (secs.)

alpha = 1.00alpha = 0.75alpha = 0.50alpha = 0.25

0

0.2

0.4

0.6

0.8

1

1000 2000 3000 4000 5000

Allo

catio

n

Time (secs.)

P = 500P = 200P = 100

(b) Effect of� (c) Effect ofPFigure 3.8. Effect of various tunable parameters on the granularity of bandwidth alloca-tions.

time-scale of interest becomes larger than the duration of the burst and consequently the

bandwidth manager ignores the burst altogether and keeps the allocation unchanged.

Together, these experiments demonstrate how these tunableparameters can be used to

control the granularity of bandwidth allocation and the sensitiveness to load fluctuations.

3.5.5 Comparison with Static Allocation

Finally we demonstrate the advantages of our dynamic allocation technique over static

bandwidth allocation. We initialize the allocation of the two classes to 50% of the total

disk bandwidth. Whereas the allocation remains fixed for static partitioning, it varies with

71

the load for dynamic allocation. We examine a scenario wherethe server experiences a

transient overload due to a burst in the real-time class and measure the queue length of

the real-time requests. Since the allocation remains fixed in former scenario, the server is

unable to respond to an overload, causing the queue of real-time requests to grow quickly.

In contrast, our bandwidth allocation technique uses request arrival rates to determine the

allocation of each class and allocates a larger bandwidth tothe real-time class. This enables

the server to exhibit a more stable behavior during an overload, resulting in a more graceful

increase in the queue length (the average queue length is also 59% smaller). We repeat the

experiment with a steady video load and a burst in the best-effort class. Again, the server

is unable to respond to the burst in case of static allocation,whereas our dynamic allocator

allocates a larger bandwidth to the best-effort class, resulting in significantly better response

times. Figures 3.9(a) and (b) demonstrate this behavior.

Dynamic bandwidth allocation can also be advantageous whenthe server employs ad-

mission control for the real-time class. If the server were to employ static bandwidth al-

location, then the admission controller would only admit asmany clients as the allocation

of the real-time class permits; additional real-time clients would be rejected from the sys-

tem even when the best-effort class is not using its entire allocation (i.e., the system has

spare capacity). In contrast, dynamic bandwidth allocation allows the server to gradually

increase the allocation of the real-time class based on its usage, thereby allowing the ad-

mission controller to admit additional clients. This results in more judicious use of system

resources. We compared static allocation to our dynamic allocation technique in the pres-

ence of admission control in the real-time class. Our experiment consisted of a fixed text

load and a video arrival every 500s. The initial allocation of the two classes was 50%.

As shown in Figure 3.9(c), dynamic bandwidth allocation permits additional clients to be

admitted into the system so long as there is unused bandwidthin the best-effort class. To-

gether these experiments demonstrate the benefits of a dynamic bandwidth allocation over

static allocation.

72

0

200

400

600

800

1000

1200

1400

1600

1000 2000 3000 4000

Que

ue le

ngth

Time (secs.)

Dynamic AllocationStatic Allocation

(a) Queue lengths of the real-time class

0

500

1000

1500

2000

2500

3000

1000 2000 3000 4000 5000

Ave

rage

Res

pons

e T

ime

(ms)

Time (secs.)


0

2

4

6

8

10

12

2000 4000 6000 8000 10000

No.

of V

ideo

Clie

nts

Time (secs.)


(b) Text response times (c) Impact of admission control

Figure 3.9.Comparison with Static Partitioning

73

3.6 Related Work

A number of recent and ongoing research efforts have focusedon the design of self-

managing systems [34, 49]. The IStore project, for instance, investigated the design of

work-load monitoring and adaptive resource management techniques for data-intensive net-

work services [17]. Unlike their focus on data-intensive network applications, the focus of

our work is on mixed (best-effort and streaming media) workloads. The VINO project has

investigated the design of self-managing techniques for various OS tasks such as paging,

interrupt latency and disk waits [52]. Research on storage systems at HP Labs has also

investigated various issues in self managing systems such as self-configuration (Minerva

[8]), capacity planning [14] and goal-based storage management [8]. Finally, a number of

predictable disk scheduling algorithms have been proposed[13, 42, 43, 54, 60]. As indi-

cated earlier, these efforts are complementary to our effort, since our bandwidth allocator

can coexist with any such scheduler.


In this chapter, we focused on the problem of self-managing bandwidth allocation to

improve the manageability of modern file servers. We presented two techniques for dy-

namic bandwidth allocation—one for single disk servers andthe other for servers employ-

ing multiple disks or disk arrays. Both techniques consist of two components: a workload

monitoring module that efficiently monitors the load in eachapplication class and a band-

width manager that uses these workload statistics to dynamically determine the allocation

of each class. We have evaluated the efficacy of our techniques via a simulation study us-

ing synthetic and trace workloads [56]. Our results show that these techniques (i) provide

control over the time-scale of allocation via tunable parameters, (ii) have stable behavior

during overload, and (iii) provide significant advantages over static bandwidth allocation.

74

CHAPTER 4

LEARNING-BASED APPROACH FOR DYNAMIC BANDWIDTHALLOCATION

In the previous chapter we looked at the problem of dynamic bandwidth allocation in

the context of multimedia servers. We assumed that the workload is classified into two

application classes, namely soft real-time and best-effort, based on the data type. Multiple

data types is just one aspect of the problem; one could also have a large storage system

hosting multiple application classes with different performance requirements.

In this chapter, we assume that the storage system is accessed by applications that

can be categorized into different classes; each class is assumed to impose a certain QoS

requirement in the form a response time requirement. The workload seen by an application

class varies over time, and we address the problem of how to allocate storage bandwidth to

classes in presence of varying workloads so that their QoS needs are met. We use a learning

based approach to address the problem. In the next section, we present the system model,

outline the key requirements of the bandwidth allocation technique and then describe the

problem in further detail.

4.1 Problem Definition

4.1.1 Background and System Model

An enterprise storage system consists of a large number of disks that are organized into

disk arrays. A disk array is a collection of physical disks that presents an abstraction of

a single large logical storage device to the rest of the system; we refer to this abstraction

as a logical unit (LU). An application, such as a database or afile system, is allocated

75

storage space by concatenating space from one or more logical units; the concatenated

storage space is referred to as a logical volume (LV). Figure4.1 illustrates the mapping

from logical volumes to logical units.

We assume that the workload accessing each logical volume can be partitioned intoap-

plication classes. This grouping can be determined based on the files accessed by requests

in each class or the QoS requirements of these requests. Eachapplication class is assumed

to have a certain response time requirement. Application classes compete for storage band-

width and the bandwidth allocated to a class governs the response time of its requests.

To enable such allocations, each disk in the system is assumed to employ a QoS-aware

disk scheduler (such as [13, 54, 60]). Such a scheduler allows disk bandwidth to be reserved

for each class and enforces these allocations at a fine time scale. Thus, if a certain disk

receives requests fromn application classes, then we assume that the system dynamically

determines the reservationsR1; R2; � � �Rn for these classes such that the response time

needs of each class are met andPni=1Ri = 1 (the reservationRi essentially denotes the

fraction of the total disk bandwidth allocated to classi; 0 � Ri � 1).

4.1.2 Key Requirements

Assuming the above system model, consider a bandwidth allocation technique that dy-

namically determines the reservationsR1; R2; ::::; Rn based on the requirements of each

class. Such a bandwidth allocation scheme should satisfy the following key requirements.� Meet class response time requirements:Assuming that each class specifies a tar-

get response-timedi, the bandwidth allocation techniques should allocate sufficient

bandwidth to each class to meet its target response-time requirements. Whether this

goal can be met depends on the load imposed by each application class and the ag-

gregate load. In scenarios where the response time needs of aclass cannot be met

(possibly due to overload), the bandwidth allocation technique should attempt to

minimize the difference between the observed and the targetresponse times.

76

Logical Unit (LU)LU

DisksDisks

Logical Volume Logical Volume Logical Volume

Class 2Class 1 Class 3 Class 4 Class 5

Logical volumes on the extreme right and left are accessed bytwo application classes each, and theone in the center by a single application class. The storage system sees a total of five applicationclasses. Disks comprising the left LU see requests from classes 1,2 and 3; disks on the right LU seeworkload from all 5 application classes.

Figure 4.1.Relationship between application classes, logical volumes and logical units.� Performance isolation: Whereas the dynamic allocation technique should react to

changing workloads, for example, by allocating additionalbandwidth to classes that

see an increased load, such increases in allocations shouldnot affect the perfor-

mance of less loaded classes. Thus, only spare bandwidth from underloaded classes

should be reallocated to classes that are heavily loaded, thereby isolating underloaded

classes from the effects of overload.� Stable overload behavior:Overload is observed when the aggregate workload ex-

ceeds disk capacity, causing the target response times of all classes to be exceeded.

The bandwidth allocation technique should exhibit stable behavior under overload.

This is especially important for a learning-based approach, since such techniques

systematically search though various allocations to determine the correct allocation;

doing so under overloads can result in oscillations and erratic behavior. A well-

designed dynamic allocation scheme should prevent such unstable system behavior.

77

4.1.3 Problem Formulation

To precisely formulate the problem addressed, consider an individual disk from a large

storage system that services requests fromn application classes. Letd1; d2; : : : ; dn denote

the target response times of these classes. LetRt1; Rt2; : : : ; Rtn denote the response time

of these classes observed over a periodP . Then the dynamic allocation technique should

compute reservationsR1; R2; � � � ; Rn such thatRti � di for any classi subject to the

constraintPiRi = 1 and0 � Ri � 1. Since it may not always be possible to meet

the response time needs of each class, especially under overload, we modify the above

condition as follows: instead of requiringRti � di; 8i, we require that the response time

should be less than or as close to the target as possible. Thatis, (Rti�di)+ should be equal

to or as close to zero as possible (the notationx+ equalsx for positive values ofx and

equals 0 for negative values). Instead of attempting to meetthis condition for each class,

we define a new metric sigma+rt = nXi=1 (Rti � di)+ (4.1)

and require thatsigma+rt be minimized. Observe that,sigma+rt represents the aggregate

amount by which the response time targets of classes are exceeded. Minimizing a sin-

gle metricsigma+rt enables the system to collectively minimize the QoS violations across

application classes.

We now present a learning-based approach that tries to minimize thesigma+rt observed

at each disk subject to the key requirements specified in Section 4.1.2.

4.2 A Learning-based Approach

In this section, we first present some background on reinforcement learning and then

present a simple learning-based approach for dynamic storage bandwidth allocation. We

discuss limitations of this approach and present an enhanced learning-based approach that

overcomes these limitations.

78

4.2.1 Reinforcement Learning Background

Any learning-based approach essentially involves learning from past history. Rein-

forcement learning involves learning how to map situationsto actionsso as to maximize

a numericalreward (equivalent of acostor utility function) [58]. It is assumed that the

system does not know which actions to take in order to maximize the reward; instead the

system must discover (“learn”) the correct action by systematically trying various actions.

An actionis defined to be one of the possible ways to react to the currentsystem state. The

system state is defined to be a subset of what can be perceived from the environment at any

given time.

In the dynamic storage allocation problem, an action is equivalent to setting the allo-

cations (i.e., the reservations) of each class. The system state is the vector of the observed

response times of the application classes. The objective ofreinforcement learning is to

maximize the reward despiteuncertaintyabout the environment (in our case, the uncer-

tainty arises due to the variations in the workload). An important aspect of reinforce-

ment learning is that, unlike some learning approaches, no prior training of the system is

necessary—all the learning occurs online, allowing the system to deal with unanticipated

uncertainties (e.g., events, such as flash crowds, that can not have been anticipated in ad-

vance). It is this feature of reinforcement learning that makes it particularly attractive for

our problem.

A reward function defines the goal in the reinforcement learning; by mapping an action

to a reward, it determines the intrinsic desirability of that state. For the storage alloca-

tion problem, we define the reward function to be�sigma+rt—maximizing reward implies

minimizing sigma+rt and the QoS violations of classes. In reinforcement learning, we use

reward values learned from past actions to estimate the expected reward of a (future) action.

With the above background, we present a reinforcement learning approach based on

action valuesto dynamically allocate storage bandwidth to classes.

79

4.2.2 System State

A simple definition of system state is a vector of the responsetimes of then classes:(Rt1; Rt2; : : : ; Rtn), whereRti denotes the mean response time of classi observed over

a periodP . Since the response time of a class can take any arbitrary value, the system

state space is theoretically infinite. Further, the system state by itself does not reveal if

a particular class has met its target response time. Both limitations can be addressed by

discretizing the state space as follows: partition the range of the response time (which is[0;1)) into four partsf[0; di � �i℄; (di � �i; di℄; (di; di + �i℄; (di + �i;1)gand map the observed response timeRti into one of these sub-ranges (�i is a constant). The

first range indicates that the class response time is substantially below its target response

time (by a threshold�i). The second (third) range indicates that the response timeis slightly

below (above) the target and by no more than the threshold�i. The fourth range indicates

a scenario where the target response time is substantially exceeded. We label these four

states aslo�, lo, hi andhi+, respectively, with the labels indicating different degrees of

over- and under-provisioning of bandwidth (see Figure 4.2). The state of a class is defined

asSi 2 flo�; lo; hi; hi+g and the modified state space is a vector of these states for each

class:S = (S1; S2; : : : ; Sn). Observe that, since state of a class can take only four values,

the potentially infinite state space is reduced to a size of4n.

4.2.3 Allocation Space

The reservation of a classRi is a real number between 0 and 1. Hence, the alloca-

tion space(R1; R2; : : : ; Rn) is infinite due to the infinitely many allocations for each class.

Since a learning approach must search through all possible allocations to determine an ap-

propriate allocation for a particular state, this makes theproblem intractable. To discretize

the allocation space, we impose a restriction that requiresthe reservation of a class be mod-

80

lo− hi+lo hi

Increasing Response Time

Response Time Requirement = d

Heavy

Overload

Heavy

Underload OverloadUnderload

d − tau d + tau

0 oo

Figure 4.2. Discretizing the State Space

ified in steps ofT , whereT is an integer. For instance, if the step size is chosen to be

1% or 5%, the reservation of a class can only be increased or decreased by a multiple of

the step size. Imposing this simple restriction results in afinite allocation space, since the

reservation of a class can only take one ofm possible values, wherem = 100=T . With nclasses, the number of possible combinations of allocations is(m+n�1m ), resulting in a finite

allocation space. Choosing an appropriate step size allowsallocations to be modified at a

sufficiently fine grain, while keeping the allocation space finite. In what follows, we use

the termsactionandallocationinterchangeably.

4.2.4 Cost and State Action Values

For the above definition of state space, we observe that the response time needs of

a class are met so long it is in thelo� or lo states. In the event an application class is

in hi or hi+ states, the system needs to increase the reservations of theclass, assuming

spare bandwidth is available, to induce a transition back tolo� or lo. This is achieved

by computing a new set of reservations(R1; R2; : : : ; Rn) so as to maximize the reward�sigma+rt. Note that the maximum value of the reward is zero, which occurs when the

response time needs of all classes are met (see Equation 4.1).

A simple method for determining the new allocation is to pickone based on the ob-

served rewards of previous actions from this state. An action (allocation) that resulted in

largest reward (�sigma+rt) is likely to do so again and is chosen over other lower reward

81

actions. Making this decision requires that the system firsttry out all possible actions, pos-

sibly multiple times, and then choose one that yields the largest reward. Over a period of

time, each action may be chosen multiple times and we store anexponential average of the

observed reward from this action (to guide future decisions):Qnew(S1;S2;::::;Sn)(a) = �Qold(S1;S2;::::;Sn)(a) + (1� ) � �sigma+rt(a) (4.2)

whereQ denotes the exponentially averaged value of the reward for action a taken from

state(S1; S2; : : : ; Sn) and is the exponential smoothing parameter (also known as the

forgetting factor). Learning methods of this form, where the actions selected are based

on estimates of action-reward values (also referred to as action values), are referred to as

action-value methods.

We choose an exponential average over a sample average because the latter is appro-

priate only for stationary environments. In our case, the environment is non-stationary due

to the changing workloads and the same action from a state mayyield different rewards

depending on the current workload. For such scenarios, recency-weighted exponential av-

erages are more appropriate. With4n states and(m+n�1m ) possible actions in each state, the

system will need to store(m+n�1m ) � 4n such averages, one for each action.

4.2.5 A Simple Learning-based Approach


from each system state, computes the reward for each action and stores these values to

guide future allocations. Note that it is the discretization of the state space and the allo-

cation space as described in Sections 4.2.4 and 4.2.2 which make this approach possible.

Once the reward values are determined for the various actions, upon a subsequent transi-

tion to this state, the system can use these values to pick a allocation with the maximum

reward. The set of learned reward values for a state is also referred to as thehistoryof the

state. As an example, consider two application classes thatare allocated 50% each of the

82

disk bandwidth and are in(lo�; lo�). Assume that a workload change causes a transition to(lo�; hi+). Then the system needs to choose one of several possible allocations:(0; 100),(5; 95), (10; 90); : : :, (100; 0). Choosing one of these allocations allows the system to learn

the reward�sigma+rt that accrues as a result of that action. After trying all possible al-

locations, the system can use these learned values to directly determine an allocation that

maximizes reward (by minimizing the aggregate QoS violations). This quicker and suit-

able reassignment of class allocations is facilitated by learning. Figure 4.3 shows the steps

involved in a learning-based approach.

Although such a reinforcement learning scheme is simple to design and implement, it

has numerous drawbacks.� Actions are oblivious of system state:A key drawback of this simple learning

approach is that the actions are oblivious of the system state—the approach tries

all possible actions, even ones that are clearly unsuitablefor a particular state. In the

above example, for instance, any allocation that decreasesthe share of the overloadedhi+ class and increases that of the underloadedlo� class is incorrect. Such an action

can worsen the overall system performance. Nevertheless, such actions are explored

to determine their reward. The drawback arises primarily because the semantics of

the problem are not incorporated into the learning technique.� No performance isolation: Since the system state is not taken into account while

making allocation decisions, the approach can not provide performance isolation to

classes. In the above example, an arbitrary allocation of(0; 100) can severely affect

thelo� class while favoring the overloaded class.� Large search space and memory requirements:Since there are(m+n�1m ) possible

allocations in each of the4n states, a systematic search of all possible allocations is

impractical. This overhead is manageable whenn = 2 classes andm = 20 (which

corresponds to a step size of 5%;m = 100=5), since there are only(2120) = 2183

Update Action

Values

Average Class

Response Times

Determine System

State

Compute New Allocation

Queues

Class Specific

Storage Device

(Disk)

Requests

Compute Reward

Period P

Sleep for Re−computation

QoS Aware

Disk Scheduler

Figure 4.3.Steps involved in learning

allocations for each of the42 = 16 states. However, forn = 5 classes, the number

of possible actions increases to 10626 for each of the45 states. Since the number

of possible actions increases exponentially with increasein the number of classes, so

does the memory requirement (since the reward for each allocation needs to be stored

in memory to guide future allocations). Forn = 5 classes andm = 20, 83MB of

memory is needed per disk to store these reward values. This overhead is impractical

for storage systems with large number of disks.

4.2.6 An Enhanced Learning-based Approach

We design an enhanced learning approach that uses the semantics of the problem to

overcome the drawback of the naive learning approach outlined in the previous section.

The key insight used in the enhanced approach is to use the state of a class to determine

whether to increase or decrease its allocation (instead of naively exploring all possible

allocations). In the example listed in the previous section, for instance, only those allo-

cations that increase the reservation of the overloaded class and decrease the allocation of

the underloaded class are considered. The technique also includes provisions to provide

84

Class Response Times

Deterime System State

All Classes in

hi or hi+ ?

class in hi+If ( < its defaultallocation )

set allocation to default ;

else

allocation of some

leave allocation unchanged ;

Leave Allocation

Unchanged

All Classes in

same State ?

elseTake Action based on Reward

if ( history exists )

Reassign T from Underloaded

to Overloaded Class

Some Overloaded

and Some Underloaded ?

lo− or lo

Figure 4.4.Algorithm flowchart

performance isolation, achieve stable overload behavior,and reduce memory and search

space overheads.

Initially, we assume that the allocations of all classes areset to a default value (a simple

default allocation is to assign equal shares to the classes;any other default may be speci-

fied). We assume that the allocations of classes are recomputed everyP time units. To do

so, the technique first determines the system state and then computes the new allocation for

this state as follows:� Case I: All classes are underloaded (are inlo� or lo). Since all classes are inlo orlo�, by definition, their response time needs are satisfied and noaction is necessary.

Hence, the allocation is left unchanged. An optimization ispossible when some

classes are inlo� and some are inlo. Since the goal is to drive all classes to as low as

state as possible, one can reallocate bandwidth from the classes inlo� to the classes

in lo. How bandwidth is reallocated and history maintained to achieve this is similar

to the approach described in Case III below.� Case II: All classes are overloaded (are inhi or hi+). Since all classes are inhior hi+, the target response times of all classes are exceeded, indicating an overload

85

situation. While every class can use extra bandwidth, none exists in the system. Since

no spare bandwidth is available, we leave the allocations unchanged.

An additional optimization is possible in this state. If some class is heavily over-

loaded (i.e., is inhi+) and is currently allocated less than its initial default allocation,

then the allocation of all classes is set to their default values (the allocation is left un-

changed otherwise). The insight behind this action is that no class should be inhi+due to starvation resulting from an allocation less than itsdefault. Resetting the al-

locations to their default values during such heavy overloads ensures that the system

performance is no worse than a static approach that allocates the default allocation to

each class.� Case III: Some classes are overloaded, others are underloaded (some in hi+ or hiand some inlo or lo�). This is the scenario where learning is employed. Since some

classes are underloaded while others are overloaded, the system should reallocate

spare bandwidth from underloaded classes to overloaded classes. Initially, there is

no history in the system and the system mustlearnhow much bandwidth to reassign

from underloaded to overloaded classes. Once some history is available, the reward

values from past actions can be used to guide the reallocation.

The learning occurs as follows. The application classes arepartitioned into two

sets: lendersandborrowers. A class is assigned to the lenders set if it is inlo orlo�; classes inhi andhi+ are deemed borrowers. The basic idea is to reduce the

allocation of a lender byT and reassign this bandwidth to a borrower. Note that

the bandwidth of only one lender and one borrower is modified at any given time

and only by the step sizeT ; doing so systematically reassigns spare bandwidth from

lenders to borrowers, while learning the rewards from theseactions.

Different strategies can be used to pick a lender and a borrower. One approach is

to pick the most needy borrower and the most over-provisioned lender (these classes

can be identified by how far the class is from its target response time; the greater

86

this difference, the greater the need or the available sparebandwidth). Another ap-

proach is to cycle through the list of lenders and borrowers and reallocate bandwidth

to classes in a round-robin fashion. The latter strategy ensures that the needs of all

borrowers are met in a cyclic fashion, while the former strategy focuses on the most

needy borrower before addressing the needs of the remainingborrowers. Regardless

of the strategy, the system state is recomputedP time units after each reallocation. If

some classes continue to be overloaded, while others are underloaded, we repeat the

above process. If the system transitions to a state defined byCase I or II, we handle

them as discussed above.

The reward obtained after each allocation is stored as an exponentially-smoothed

average (as shown in Equation 4.2). However, instead of storing the rewards of all

possible actions, we only store the rewards of the actions that yield thek highest

rewards. The insight here is that the remaining actions do not yield a good reward

and, since the system will not consider them subsequently, we do not need to store

the corresponding reward values. These actions and their corresponding reward esti-

mates are stored as a link list, with the neighboring elements in the link list differing

in the allocations of two classes by the step sizeT , that of one lender and one bor-

rower. This facilitates a systematic search of the suitableallocation for a state, and

also pruning of the link list to maintain a size of no more thank. By storing a fixed

number of actions and rewards for any given state, the memoryrequirements can

be reduced substantially. Further, while the allocation ofa borrower and a lender

is changed only byT in each step during the initial learning process, these can be

changed by a larger amount subsequently once some history isavailable (this is done

by directly picking the allocation that yields the maximum reward).

Figure 4.4 summarizes our technique. As a final optimization, we use a small non-zero

probability� to bias the system to occasionally choose a neighboring allocation instead of

the allocation with the highest reward (a neighboring allocation is one that differs from

87

the best allocation by the step sizeT for the borrowing and lending classes, e.g.,(30; 70)instead of(35; 65) whenT = 5%). The reason we do this is that it is possible the value of

an allocation is underestimated as a result of a sudden workload reversal, and the system

may thus select the best allocation based on the current history. An occasional choice of a

neighboring allocation ensures that the system explores the state space sufficiently well to

discover a suitable allocation.

Observe that our enhanced learning approach reclaims bandwidth only from those

classes that have bandwidth to spare (lo and lo� classes) and reassigns this bandwidth to

classes that need it. Since a borrower takes up bandwidth in increments ofT from a lender,

the lender could in the worst case end up in statehi 1. At this stage there would be a state

change, and the action would be dictated by this new state. Thus, this strategy ensures that

any new allocation chosen by the approach can only improve (and not worsen) the system

performance; doing so also provides a degree of performanceisolation to classes.

The technique also takes the current system state into account while making allocation

decisions and thereby avoids allocations that are clearly inappropriate for a particular state;

in other words, the optimized learning technique intelligently guides and restricts the al-

location space explored. Further, since only thek highest reward actions are stored, the

worst case search overhead is reduced toO(k). This results in a substantial reduction from

the search overheads of the simple learning approach. Finally, the memory needs of the

technique reduce from(m+n�1m ) to 4n � k, wherek is the number of high reward actions for

which history is maintained. This design decision also results in a substantial reduction in

the memory requirements of the approach. In the case of 5 application classes,T = 5%(recallm = 100=T ) andk = 5, for example, the technique yields more than 99% reduction

in memory needs over the simple learning approach.

1The choice of the step sizeT is of importance here. If the step-size is too big the overloaded class couldend up in underload and vice versa and this could result in oscillations.

88

4.3 Implementation in Linux

We have implemented our techniques in the Linux kernel version 2.4.9. Our prototype

consists of three components: (i) a QoS-aware disk scheduler that supports per-class reser-

vations, (ii) a module that monitors the response time requirements of each class, and (iii)

a learning-based bandwidth allocator that periodically recomputes the reservations of the

classes on each disk. Our prototype was implemented on a DellPowerEdge server (model

2650) with two 1 GHz Pentium III processors and 1 GB memory that runs RedHat Linux

7.2. The server was connected to a Dell PowerVault storage pack (model 210) with eight

SCSI disks. Each disk is a 18GB 10,000 RPM Fujitsu MAJ3182MC disk; the characteris-

tics of the disk are shown in Table 2.1 in Chapter 22. We use the software RAID driver in

Linux to configure the system as a single RAID-0 array.

We implement the Cello QoS-aware disk scheduler in the Linuxkernel [54]. The disk

scheduler supports a configurable number of application classes and allows a fraction of the

disk bandwidth to be reserved for each class (these can be setusing the scheduler system

call interface). These reservations are then enforced on a fine time scale, while taking

disk seek overheads into account. We extend theopensystem call to allow applications to

associate file I/O with an application class; all subsequentread and write operations on the

file are then associated with the specified class. The use of our enhanced open system call

interface requires application source code to be modified. To enable legacy application to

benefit from our techniques, we also provide a command line utility that allows a process

(or a thread) to be associated with an application class—allsubsequent I/O from the process

is then associated with that class. Any child processes thatare forked by this process inherit

these attributes and their I/O requests are treated accordingly.

We also add functionality into the Linux kernel to monitor the response times of re-

quests in each class (at each disk); the response time is defined to the sum of the queuing

2The Fujitsu MAJ3182MC disk has an average seek overhead of 4.7 ms, an average latency of 2.99 msand a data transfer rate of 39.16 MB/s.

89

delay and the disk service times. We compute the mean response time in each class over a

moving window of durationP .

The bandwidth allocator runs as a privileged daemon in user space. It periodically

queries the monitoring module for the response time of each class; this can done using a

special-purpose system call or via the=pro interface in Linux. The response time values

are then used to compute the system state. The new allocationis then determined and

conveyed to the disk scheduler using the scheduler interface.


In this section, we demonstrate the efficacy of our techniques using a combination of

prototype experimentation and simulations. In what follows, we first present our simulation

methodology and simulation results, followed by results from our prototype implementa-

tion.

4.4.1 Simulation Methodology and Workload

We use an event-based storage system simulator to evaluate our bandwidth allocation

technique. The simulator simulates a disk array that is accessed by multiple application

classes. Each disk in the array is modeled as a 18GB 10,000 RPMFujitsu MAJ3182MC

disk. The disk array is assumed to be configured as a RAID-0 array with multiple volumes;

unless specified otherwise we assume an array of 8 disks . Eachdisk in the system is as-

sumed to employ a QoS-aware disk scheduler that supports class-specific reservations; we

use the Cello disk scheduler [54] for this purpose. Observe that the hardware configuration

assumed in our simulations is identical to that in our prototype implementation. We assume

that the system monitors the response times of each class over a periodP and recomputes

the allocations after each such period. We chooseP = 5s in our experiments. Unless

specified otherwise, we choose a target response time ofdi = 100ms for each class and the

90

threshold�i for discretizing the class states into thelo�, lo, hi andhi+ categories is set to

20ms.

We use a two types of workloads in our simulations: trace-driven and synthetic. We

use NFS traces to determine the effectiveness of our methodsfor real-world scenarios.

However, since a trace workload only represents a small subset of the operating region, we

use a synthetic workload to systematic explore the state space.

We use portions of an NFS trace gathered from the Auspex file server at Berkeley

[23] to generate the trace-driven workload. To account for caching effects, we assume a

large LRU buffer cache at the server and filter out requests resulting in cache hits from the

original trace; the remaining requests are assumed to result in disk accesses. The resulting

NFS trace is very bursty and has a peak to average bit rate of 12.5.

Our synthetic workload consist of Poisson arriving clientsthat read a randomly selected

file. File sizes are assumed to be heavy-tailed; we assume fixed-size requests that sequen-

tially read the selected file. By carefully controlling the arrival rates of such clients, we can

construct transient overload scenarios (where a burst of clients arrive in quick succession).

Next, we present our experimental results.

4.4.2 Effectiveness of Dynamic Bandwidth Allocation

We begin with a simple simulation experiment to demonstratethe behavior of our dy-

namic bandwidth allocation approach in the presence of varying workloads. We configure

the system with two application classes. We choose an exponential smoothing parameter = 0:5, the learning step sizeT = 5% and the number of stored values per statek = 5.

The target response time is set to 75ms for each class and the re-computation period was

5s. Each class is initially assigned 50% of the disk bandwidth.

We use a synthetic workload for this experiment. Initially both classes are assumed to

have 5 concurrent clients each; each client reads a randomlyselected file by issuing 4 KB

requests. At timet = 100s, the workload in class 1 is gradually increased to 8 concurrent

91

0

2

4

6

8

10

0 200 400 600 800 1000

No.

of C

lient

sTime (secs.)

Class 1Class 2

(a) Workload

0

50

100

150

200

250

300

350

400

450

500

200 400 600 800 1000

Ave

rage

Res

pons

e T

ime

(ms)

Time (secs.)

StaticLearning

Target Response Time

0

50

100

150

200

250

300

350

400

450

500

200 400 600 800 1000

Ave

rage

Res

pons

e T

ime

(ms)

Time (secs.)

StaticLearning


(b) Average Response Time: Class 1 (c) Average Response Time: Class 2

Figure 4.5.Behavior of the learning-based dynamic bandwidth allocation technique.

clients. Att = 600s, the workload in class 2 is gradually increased to 8 clients.The system

experiences a heavy overload fromt = 700 to t = 900s. At t = 900s, several clients depart

and the load reverts to the initial load. We measure the response times of the two classes

and then repeat the experiment with a static allocation of(50%; 50%) for each class.

Figures 4.5 depicts the class response times. As shown the dynamic allocation tech-

nique adapts to the changing workload and yields response times that are close to the target.

Further, due to the adaptive nature of the technique, the observed response times are, for

the most part, better than that in the static allocation. Observe that, immediately after a

workload change, the learning technique requires a short period of time to learn and adjust

92

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

2000 4000 6000 8000 10000

Cum

ulat

ive

QoS

Vio

latio

ns

Time (secs.)

Dy. Alloc. with LearningDy. Alloc. w/o Learning

Static

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

1000 2000 3000 4000 5000

Cum

ulat

ive

QoS

Vio

latio

ns

Time (secs.)

Enhanced LearningStatic

Naive Learning

(a) Trace Workload (b) Comparison with Simple Learning

Figure 4.6.Comparison with Alternative Approaches

the allocations, and this temporarily yields a response time that is higher than that in the

static case (e.g., att = 600s in Fig 4.5(b)). Also, observe that betweent = 700 andt = 900the system experiences a heavy overload and, as discussed inCase II of our approach, the

dynamic technique resets the allocation of bothhi+ classes to their default values, yielding

a performance that is identical to the static case.

4.4.3 Comparison with Alternative Approaches

In this section, we compare our learning-based approach with three alternate approaches:

(i) static, where the allocation of classes is chosen statically, (ii)dynamic allocation with

no learning, where the allocation technique is identical to our technique but no learning

is employed (i.e., allocations are left unchanged when all classes are underloaded or over-

loaded as in Cases I and II in Section 4.2.6, and in Case III bandwidth is reassigned from

the least underloaded class to the most overloaded class in steps ofT , but no learning is

employed), and (iii) thesimple learningapproach outlined in Section 4.2.5.

We use the NFS traces to compare our enhanced learning approach with the static and

the dynamic allocation techniques with no learning. We configure the system with three

93

classes with different scale factors3 and set the target responses time of each class to 100ms.

The re-computation period is chosen to be 5s. We use different portions of our NFS trace

to generate the workload for the three classes. The stripe unit size for the RAID-0 array is

chosen to be 8 KB. We use about 2.8 hours of the trace for this experiment.

We run the experiment for our learning-based allocation technique and repeat it for

static allocation and dynamic allocation without learning. In figure 4.6(a) we plot the cu-

mulativeP sigma+rt (i.e., the cumulative QoS violations observed over the duration of

the experiment) for the three approaches; this metric helpsus quantify the performance of

an approach in the long run. Not surprisingly, the static allocation techniques yields the

worst performance and incurs the largest number of QoS violations. The dynamic alloca-

tion technique without learning yields a substantial improvement over the static approach,

while dynamic allocation with learning yields a further improvement. Observe that the

gap between static and dynamic allocation without learningdepicts the benefits of dynamic

allocation over static, while the gap between the technique without learning and our tech-

niquedepicts the additional benefits of employing learning. Overall, we see a factor of

3.8 reduction in QoS violations when compared to a pure static scheme and a factor of 2.1

when compared to a dynamic technique with no learning.

Our second experiment compares our enhanced learning approach with the simple

learning approach described in Section 4.2.5. Most parameters are identical to the pre-

vious scenario, except that we only assume two application classes instead of three for

this experiment. Figure 4.6(b) plots the cumulative QoS violations observed for the two

approaches (we also plot the performance of static allocation for comparison). As can be

seen, the naive learning approach incurs a larger search/learning overhead since it system-

atically searches through all possible actions. In doing so, incorrect actions that exacerbate

the system performance are explored and actually worsen performance. Consequently, we

3The scale factor scales the inter-arrival times of requestsand allows control over the burstiness of theworkload.

94

0

0.5

1

1.5

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Nor

mai

lized

Cum

ulat

ive

QoS

Vio

latio

nsForgetting Factor

(a) Effect of Smoothing Parameter

0

0.5

1

1.5

2

1 2 3 4 5 6 7 8 9 10

Nor

mai

lized

Cum

ulat

ive

QoS

Vio

latio

ns

Step Size

0

0.5

1

1.5

2

2 3 4 5 6 7 8 9 10

Nor

mai

lized

Cum

ulat

ive

QoS

Vio

latio

ns

No. of Values Stored Per State

(b) Effect of Step Size t (c) Effect ofkFigure 4.7. Impact of Tunable Parameters

see a substantially larger number of QoS violations in the initial period; the slope of the

violation curve reduces sharply once some history is available to make more informed de-

cisions. Consequently, during this initial learning process, a naive learning process under-

performs even the static scheme; the enhanced learning technique does not suffer from

these drawbacks, and like before, yields the best performance.

95

4.4.4 Effect of Tunable Parameters

We conduct several experiments to study how the choice of three tunable parameters

affects the system behavior: the exponential smoothing parameter , the step sizeT and

the history sizek that defines the number of high reward actions stored by the system.

First, we study the impact of the smoothing parameter . Recall from Equation 4.1 that = 0 implies that only the most recent reward value is considered, while = 1 completely

ignores reward values. We chooseT = 5% andk = 5. We vary systematically from 0.0

to 0.9, in steps of 0.1 and study its impact on the observed QoSviolations. We normalize

the cumulative QoS violations observed for each value of with the minimum number of

violations observed for the experiment. Figure 4.7(a) plots our results. As shown in the

figure, the observed QoS violations are comparable for values in the range (0,0.6). The

number of QoS violations increases for larger values of gamma—larger values of provide

less importance to more recent reward values and consequently, result in larger QoS vio-

lations. This demonstrates that, in the presence of dynamically varying workloads, recent

reward values should be given sufficient importance. We suggest choosing a between 0.3

and 0.6 to strike a balance between the recent reward values and those learned from past

history.

Next, we study the impact of the step sizeT . We choose = 0:5, k = 4 and varyTfrom 1% to 10% and observe its impact on system performance. Note that a small value ofT allows fine-grain reassignment of bandwidth but can increase the time to search for the

correct allocation (since the allocation is varied only in steps ofT ). In contrast, a larger

value ofT permits a faster search but only permits coarse-grain reallocation. Figure 4.7(b)

plots the normalized QoS violations for different values ofT . As shown, very small values

of T result in a substantially higher search overhead and increase the time to converge to the

correct allocation, resulting in higher QoS violations. Moderate step sizes ranging from 3%

to as large as 10% seem to provide comparable performance. Tostrike a balance between

fine-grain allocation and low learning (search) overheads,we suggest step sizes ranging

96

from 3-7%. Essentially, the step size should be sufficientlylarge to result in a noticeable

improvement in the response times of borrowers but not largeenough to adversely affect a

lender class (by reclaiming too much bandwidth).

Finally, we study the impact of varying the history sizek on the performance. We

choose = 0:5, T = 5% and varyk from 1 to 10 (we omit the graph due to space

constraints). Figure 4.7(c) plots the cumulative QoS violations normalized by the history

size with the least violations. Initially, increasing the history size results in a small decrease

in the number of QoS violations, indicating that additionalhistory allows the system to

make better decisions. However, increasing the history size beyond 5 does not yield any

additional improvement. This indicates that storing a small number of high reward actions

is sufficient, and that it is not necessary to store the rewardfor every possible action, as in

the naive learning technique, to make informed decisions. Using a small value ofk also

yields a substantial reduction in the memory requirements of the learning approach.

4.4.5 Implementation Experiments

We now demonstrate the effectiveness of our approach by conducting experiments on

our Linux prototype. As discussed in Section 4.3, our prototype consists of a 8 disk sys-

tem, configured as RAID-0 using the software RAID driver in Linux. We construct three

volumes on this array, each corresponding to an applicationclass. We use a a mix of three

different applications in our study, each of which belongs to a different class: (1)Post-

greSQL database server:We use the publicly available PostgreSQL database server version

7.2.3 and thepgbench 1.2benchmark. This benchmark emulates the TPC-B transactional

benchmark and provides control over the number of concurrent clients as well as the num-

ber of transactions performed by each client. The benchmarkgenerates a write-intensive

workload with small writes. (2)MPEG Streaming Media Server:We use a home-grown

MPEG-1 streaming media server to stream a 90 minute videos tomultiple clients over UDP.

Each video has a constant bit rate of 2.34 Mb/s and represent asequential workload with

97

large reads. (3)Apache Web Server:We use the Apache web server and the publicly avail-

able SURGE web workload generator to generate web workloads. We configure SURGE

to generate a workload that emulates 300 time-sharing usersaccessing a 2.3 GB data-set

with 100,000 files. We use the default settings in SURGE for the file size distribution,

request size distributions, file popularity, temporal locality and idle periods of users. The

resulting workload is largely read-only and consists of small to medium size reads. Each of

the above application is assumed to belong to separate application class. To ensure that our

results are not skewed by a largely empty disk array, we populated the array with a variety

of other large and small files so that 50% of the 144GB storage space was utilized. We

choose = 0:5, T = 5%, k = 5 and a recomputation periodP = 5s. The target response

times of the three classes are set to 40ms, 50ms and 30ms, respectively.

We conduct a 10 minute experiment where the workload in the streaming server is fixed

to 2 concurrent clients (total I/O rate of 4.6 Mb/s). The database server is lightly loaded in

the first half of the experiment and we gradually increase theload on the Apache web server

(by starting a new instance of the SURGE client every minute;each new client represents

300 additional concurrent users). Att = 5 minutes, the load on the web server reverts to the

initial load (a single SURGE client). For the second half of the experiment, we introduce

a heavy database workload by configuring pgbench to emulate 20 concurrent users each

performing 500 transactions (thereby introducing a write-intensive workload).

Figure 4.8(a) plots the cumulative QoS violations observedover the duration of the

experiment for our learning technique and the static allocation technique. As shown, for the

first half of the experiments, there are no QoS violations, since there is sufficient bandwidth

capacity to meet the needs of all classes. The arrival of a heavy database workload triggers

a reallocation in the learning approach and allows the system to adapt to this change. The

static scheme is unable to adapt and incurs a significantly larger number of violations.

Figure 4.8(b) plots the time-series of the response times for the database server. As shown,

the adaptive nature of the learning approach enables it to provide better response times to

98

0

50

100

150

200

250

300

350

400

450

200 400

Cum

ulat

ive

QoS

Vio

latio

ns

Time (secs.)

StaticLearning

0

20

40

60

80

100

120

200 400

Ave

rage

Res

pons

e T

ime

(ms)

Time (secs.)

LearningStatic


(a) Cumulative QoS violations (b) Database Server

Figure 4.8.Results from our prototype implementation.

the database server. While the learning technique providescomparable or better response

time than static allocation for the web server, we see that both approaches are able to

meet the target response time requirements (due to the lightweb workload in the second

half, the observed response times are also very small). We observe a similar behavior for

the web server and the streaming server. As mentioned before, learning could perform

worse at some instants, either if it is exploring the allocation space or due to a sudden

workload change, and it requires a short period to readjust the allocations. In figure 4.8(b)

this happens aroundt = 400 s when learning performs worse than static, but the approach

quickly takes corrective action and gives better performance.

Overall, the behavior of our prototype implementation is consistent with our simulation

results.

4.4.6 Implementation Overheads

Our final experiment measures the implementation overheadsof our learning-based

bandwidth allocator. To do so, we vary the number of disks in the system from 50 to 500,

in steps of 50, and measure the memory and CPU requirements ofour bandwidth allocator.

Observe that since we are constrained by a 8 disk system, we emulate a large storage

99

system by simply replicating the response times observed ata single disk and reporting

these values for all emulated disks. From the perspective ofthe bandwidth allocator, the

setup is no different from one where these disks actually exist in the system. Further, since

the allocations on each disk is computed independently, such a strategy accurately measures

the memory and CPU overheads of our technique. We assume thatnew allocations are

computed once every 5s.

We find that the CPU requirement for our approach to be less than 0.1% even for sys-

tems with 500 disks, indicating that the CPU overheads of thelearning approach is negli-

gible. The memory overheads of the allocation daemon are also small, with the percentage

of memory used on a server with 1 GB RAM varies (almost linearly) from 1 MB (0.1 %)

for a 50 disk system to 7 MB (0.7 %) for a 500 disk system. We notethat this memory

usage is for an untuned version of our allocator where we maintain numerous additional

statistics for conducting our experiments; the actual memory requirements will be smaller

than those reported here, indicating that the technique canbe used in practical systems.

Finally, note that the system call overheads of querying response times and conveying

the new allocations to the disk scheduler can be substantialin a 500 disk system (this

involves 1000 system calls every 5 seconds, two for each disk). However, observe that, the

bandwidth allocator was implemented in user-space for easeof debugging; the functionality

can be easily migrated into kernel-space, thereby eliminating this system call overhead.

Overall, our results demonstrate the feasibility of using areinforcement learning approach

for dynamic storage bandwidth allocation in large storage systems.

4.5 Related Work

The design of a self-managing storage systems involves several sub-tasks and issues

such self-configuration [7, 10] , capacity planning [14], automatic RAID-level selection

[11], initial storage system configuration [9], SAN fabric design [59] and on-line data mi-

100

0

0.2

0.4

0.6

0.8

1

50 100 150 200 250 300 350 400 450 500

Mem

ory

Usa

ge: P

erce

ntag

e of

1 G

B R

AM

No. of Disks

Memory Usage

Figure 4.9.Memory overheads of the bandwidth allocator.

gration [39]. These efforts are complementary to our work which focuses on automatic

storage bandwidth allocation to applications with varyingworkloads.

Several other approaches ranging from control theory to online measurements and op-

timizations can also be employed to address the problem of dynamic bandwidth allocation

in storage systems. Subsequent to our work [57], control-theory and measurement-based

techniques [33, 40] have been proposed for managing storagebandwidth. Control theory

based techniques [3] as well as online measurements and optimizations [12, 47] have also

been employed for dynamically allocating resources in web servers. Utility-based opti-

mization models for dynamic resource allocation in server clusters have been employed

in [19]. Feedback-based dynamic proportional share allocation to meet real-rate disk I/O

requirements have been studied in [48]. While many feedback-based methods involve ap-

proximations such as the assumption of a linear relationship between resource share and

response time, no such limitation exists for reinforcementlearning—due to their search-

based approach, such techniques can easily handle non-linearity in system behavior. Al-

ternative techniques based on linear programming also makethe linearity assumption, and

need a linear objective function which is minimized; such a linear formulation may not be

101

possible or might turn out to be inaccurate in practice. On the other hand, a hill-climbing

based approach can handle non-linearity, but can get stuck in local maxima.

Finally, reinforcement learning has also been used to address other systems issues such

as dynamic channel allocation in cellular telephone systems [55] and adaptive link alloca-

tion in ATM networks [44].


In this chapter, we addressed the problem of dynamic allocation of storage bandwidth

to application classes so as to meet their response time requirements. We presented an

approach based on reinforcement learning to address this problem. We argued that a sim-

ple learning-based approach is not practical since it incurs significant memory and search

space overheads. To address this issue, we used application-specific knowledge to design

an efficient, practical learning-based technique for dynamic bandwidth allocation. To ad-

dress this issue, we used application-specific knowledge todesign an efficient, practical

learning-based technique for dynamic storage bandwidth allocation. Our approach can

react to dynamically changing workloads, provide isolation to application classes and is

stable under overload. Further, our technique learns online and does not require anya

priori training. Unlike other feedback-based models, an additional advantage of our tech-

nique is that it can easily handle complex non-linearity in the system behavior. We have

implemented our techniques into the Linux kernel and evaluated it using prototype experi-

mentation and trace-driven simulations. Our results show that (i) the use of learning enables

the storage system to reduce the number of QoS violations by afactor of 2.1 and (ii) the im-

plementation overheads of employing such techniques in operating system kernels is small.

Overall, our work demonstrated the feasibility of using reinforcement learning techniques

for dynamic resource allocation in storage systems.

102

CHAPTER 5

AUTOMATED OBJECT REMAPPING FOR LOAD BALANCINGLARGE SCALE STORAGE SYSTEMS

5.1 Introduction

In the last three chapters we looked at problems in the context of initial configuration

and short-term reconfiguration of storage systems. Some reconfiguration tasks need to be

executed infrequently and are necessitated by the aging of the storage system, long-term

workload changes, need for growth etc. In this chapter, we focus on one such long-term

reconfiguration task.






5.1.1 Motivation

As mentioned in Chapter 2, in storage systems, object placement—the mapping of stor-

age objects to storage devices—is crucial as it dictates theperformance of the storage sys-

tem. Consequently, extreme care is taken during capacity planning and initial configuration

of such systems [9, 14].

Although the initial configuration may be load-balanced, over time, growth in storage

space usage and changes in workload can cause load imbalances and workload hotspots.

This in turn may necessitate a reconfiguration.

103

Hotspots in storage systems can occur for one of two reasons.Incorrect or insufficient

workload information during storage system configuration may result in heavily accessed

objects being mapped to the same set of storage devices thus resulting in hotspots. Long

term workload changes or addition of a new object to a balanced system may also induce

hotspots1. Hotspots result in increased response times and a loss in throughput for ap-

plications accessing the storage system. When hotspots do occur, the mapping of objects

to storage devices needs to be revisited to ensure that the bandwidth utilization of all de-

vices is below a certain threshold so that applications see acceptable performance. Such a

reconfiguration is undesirable because it is concomitant with a downtime or a potential per-

formance impact on the applications accessing the storage system while the reconfiguration

is in progress.

Sophisticated enterprise storage sub-systems come with tools to facilitate the process

of load-balancing to address hotspots [2]. These allow for load-balancing to be either car-

ried out manually or in an automated fashion. For manual reconfiguration, administrators

use information from aworkload analyzercomponent which collects performance data and

summarizes the load on the component storage devices. The tool also provides the potential

performance impact of moving an object so that the user can make an informed decision.

The automated load balancing component, on the other hand, is self-driven, runs continu-

ously, and uses the information from the workload analyzer to swap hot and cold objects

when necessary.

Drawbacks of a manual process are that they require human oversight. Moreover, the

procedure can be error-prone and human errors during the reconfiguration process may

worsen performance. While an automated process addresses these drawbacks, a simple

approach which swaps hot and cold objects will work all the time only if objects are of

similar size. If objects are of different sizes then more sophisticated strategies are required.

1Note that this can happen irrespective of whether the systemis narrow-striped or wide-striped

104

This motivates the need for more sophisticated approaches thatsearchfor a configuration

with no hotspots.







applications.

Existing approaches do not optimize for the scale of the reconfiguration, possibly mov-

ing much more data than required to remove the hotspot. This motivates the need for a

load-balancing approach that takes sizes of objects and their current mapping to storage

devices into account. This is the subject matter of this chapter.

5.1.2 Research Contributions

In this chapter, we develop algorithms to minimize the amount of data displaced during

a reconfiguration to remove hotspots in large-scale storagesystems.

Rather than identifying a new configuration from scratch, which may entail significant

data movement, our novel approach uses the current object configuration as ahint; the goal

being to retain most of the objects in place and thus limit thescale of the reconfiguration.

The key idea in our approach is togreedilydisplace excess bandwidth from overloaded

to underloaded storage devices. This is achieved in one of two ways, (i)displace, which

involves reassigning objects from overloaded devices to underloaded ones, and (ii)swap,

which involves swapping objects between overloaded and underloaded devices. The swap

step is useful when the spare storage space on the underloaded devices is insufficient to

accommodate any additional objects, and an object reconfiguration, short of a reconfigura-

105

tion from scratch, would have to entail a swapping of objects, or groups of objects, between

storage devices.

To minimize the amount of data that needs to be moved we use thebandwidth to space

ratio (BSR) as a guiding metric. For example, by selecting highBSRobjects for reassign-

ment in the displace step, we are able to displace more bandwidth per unit of data moved.

Here, bandwidth (space) refers to the bandwidth (storage space) requirement of the storage

object. We propose various optimizations, including searching for multiple solutions, to

counter the pitfalls of a greedy approach.

We also describe a simple measurement-based technique for identifying hotspots and

for approximating per-object bandwidth requirements.

Finally, we evaluate our techniques using a combination of simulation studies and an

evaluation of an implementation in the Linux kernel. Results from the simulation study

suggest that for a variety of system configurations our novelapproach reduces the amount

of data moved to remove the hotspot by a factor of two as compared to other approaches.

The gains increase for a larger system size and magnitude of overload. Experimental results

from the prototype evaluation suggest that our measurementtechniques correctly identify





The rest of the chapter is structured as follows. In Section 5.2, we describe the prob-

lem addressed in this chapter. Section 5.3 presents object remapping techniques for load-

balancing large scale storage systems. Section 5.4 presents the methodology used for mea-

suring object bandwidth requirements and for identifying hotspots. Section 5.5 presents the

details of out prototype implementation and Section 5.6 presents the experimental results.

Section 5.7 discusses related work, and finally, Section 5.8presents our conclusions.

106

5.2 Problem Definition

5.2.1 System Model

Large scale storage systems consists of a large number of disk arrays. We assume,

as is typically the case, that each disk array consists of disks of the same type. Different

disk arrays, however, could have disks of different types. The disks in a disk array are

grouped into some number oflogical units(LU); an LU is a set of disks combined using

RAID techniques [45].

An object configuration indicates the mapping of storage objects to storage devices.

Here, an object is an equivalent of alogical volume(LV ), such as a database or a file

system, and is allocated storage space by concatenating space from one or moreLUs. From

here on, we use the termsLV and object interchangeably.

In our model, we assume that all theLUs anLV is striped over are similar i.e., they

have the sameRAID level, and comprise disks of the same type. This is generallytrue in

practice, since it ensures the same level of redundancy and similar access latency for all the

stripe units of anLV . We further make the simplifying assumption that if any twoLVs have

anLU in common, they have all their componentLUs in common. This assumption is also

generally true in well-planned storage system configurations, as in such a configuration

each object is subject to uniform inter-object workload interference on all of its component

LUs. With this assumption, the set ofLUs anLV is striped over can be thought of as a single

logical devicefor load balancing purposes. From here on, we refer to such a logical device

as alogical arrayor array for short. Figure 5.1 illustrates the system model.

5.2.2 Problem Formulation

Assuming the above system model, let us now formulate the problem addressed in this

chapter. Consider a storage system which consists ofn arrays,A1; A2::::An. There aremLVs,L1; L2::::Lm which populate the storage system. EachLV is mapped to a single array.

Each arrayAj, has storage capacitySj, and a bandwidth capacityBj. Similarly, eachLV

107

Logical Arrays

Disk Arrays Logical Units

Logical Volumes

The figure shows two disk arrays each comprising fourLUs. EachLU consists of five disks. Thedisk array on the top comprises of one logical array over which three LVs have been striped. Thesecond disk array comprises of two logical arrays, each comprising two LUs and three LVs stripedover each logical array.

Figure 5.1. System model.

108

Li has storage requirementsi and bandwidth requirementbi. For abalanced configuration,

which is defined to be a configuration without any hotspots, itis required that the percentage

bandwidth utilization of each arrayAj, not exceed some threshold�j(0 < �j < 1). The

space and the bandwidth constraint on an arrayAj is given by the following equations:�isi ij � Sj (5.1)�ibi ij � �j � Bj (5.2)

Here, ij is a mapping parameter that denotes whether objecti is mapped to arrayj— ijequals 1 if arrayj holds the objecti, and is 0 otherwise. Although the space constraint is a

hard constraint and cannot be violated, an array may observea violation of the bandwidth

constraint if the bandwidth requirements of the objects mapped to the array increase. If

the bandwidth utilization on an array exceeds the corresponding bandwidth threshold, it is

consideredoverloaded, otherwise it isunderloaded.

Moving the system to a new configuration results in a change ofmapping parameters.

Let oldij and newij denote the mapping parameter forLV i on arrayj in the old and new

configurations, respectively. Ifjxj denotes the absolute value ofx, then we have�jj newij � oldij j = 2 if the mapping of storei has changed, and is equal to 0 otherwise. Thecost of the

reconfiguration, defined as the amount of data moved to realize the new configuration, is

then given by: Cost = �i�jj newij � oldij j � si=2 (5.3)

Let O be the set of overloaded arrays in a configuration. The bandwidth violation for an

overloaded arrayj is (�ibi ij=Bj � �j). The cumulative bandwidth violation, defined as

the sum of the bandwidth violation over all overloaded arrays, is then given by:Overload = �j"O(�ibi ij=Bj � �j) (5.4)

109

Given an object configuration with some overloaded arrays and some underloaded arrays,

the goal is to identify a balanced configuration which can be realized at the least cost. Given

two new configurations, both of which satisfy the space and bandwidth constraints on all

arrays, the one that can be realized at a lower cost is preferable.

For cases where a balanced configuration cannot be found, thegoal of load balancing is

a policy decision. One may require thatOverload (equation 5.4) be minimized, but when

displacing excess bandwidth from overloaded arrays, the bandwidth constraint on the un-

derloaded arrays should not be violated and that the utilization on an already overloaded

array should not increase further. In some cases, absolute load balancing may be desir-

able, thus requiring that the maximum percentage bandwidthviolation across all arrays be

minimized. We refer to the approaches that adhere to the the former policy asfair, and

the ones that conform with the latter asabsolute. Another dimension in this context is the

cost. Absolute load balancing may incur a significantly higher cost. A complete evaluation

of the tradeoffs of gains (balance achieved) versus cost (amount of data moved) of these

policies is beyond the scope of this thesis.

In this chapter, the goal is to design a reconfiguration algorithm for identifying a bal-

anced configuration which has the least cost.

5.3 Object Remapping Techniques

There are two kinds of approaches to load balancing, (i) those that reconfigure from

scratch, and (ii) those that start with the current configuration and aim to minimize the cost

of reconfiguration. We refer to the former class of approaches ascost obliviousand the

latter ascost aware.

In the following, an assignment of an object to an array is said to bevalid if the new

object could be accommodated on the array without any constraint violations (equations

5.1 and 5.2).

110

5.3.1 Cost Oblivious Object Remapping

In this section, we present two cost-oblivious object remapping algorithms to remove

hotspots in large scale storage systems. We first present a randomized algorithm, and then

another, which is deterministic in nature.

5.3.1.1 Randomized Packing

Heuristics based on best-fit bin-packing have been used in [9] for initial storage system

configuration. There the goal was to identify a configurationwhich uses the least number

of devices to meet the space and bandwidth requirements of a given set of objects. In our

problem, the number of devices is a given, and the goal is to identify a valid packing which

can be realized at the least cost. We first present a randomized algorithm and then present

two variations of the same.

Initially, all the objects are unassigned. A random permutation of the objects is created,

and the objects are assigned to arrays picked at random from the set of all arrays. All

arrays may need to be tried for an object in the worst case. If all the objects could be

validly assigned to some array, we have a balanced configuration. The procedure could be

repeated multiple times, with different permutations of objects, and of multiple trials which

result in a balanced configuration, one with the least cost ischosen. Note that this makes

the approachsemi cost aware.

As opposed to the completely randomized approach, where both the objects and the

arrays are chosen randomly, two partly randomized variantsof interest are described next.

In a best-fit version, of all possible valid assignments for an object, the object is assigned to

the array such that new bandwidth utilization across all arrays as a result of this assignment

is a maximum. A complementary approach is also possible, worst fit, where of all possible

valid assignments, the array picked is such that the new bandwidth utilization across all ar-

rays as a result of this assignment is a minimum. Whereas best-fit may fare better in finding

a balanced configuration in bandwidth constrained scenarios, worst fit may yield a config-

111

uration with similar bandwidth utilization on all arrays. Consequently, in less bandwidth

constrained scenarios, when the arrays have utilization values well below their correspond-

ing threshold, worst fit may be advantageous, since with moreheadroom, arrays can absorb

workload variations better.

5.3.1.2 BSR-based Approach

Bandwidth to Space Ratio(BSR) has been used as a metric for video placement [24].

These derive from the heuristics based on value per unit weight used for knapsack prob-

lems. The knapsack heuristic involves greedily selecting items ordered by their value per

unit weight in order to maximize the value of the items in the knapsack. The approach

described next usesBSRas a guiding metric, but as explained later, for slightly different

reasons.

The BSRof an object is defined to be the ratio of its bandwidth requirement to its

space requirement. We define thespareBSRof an array as the ratio of its spare bandwidth

capacity to its spare space capacity. So, thespareBSRof an array is a dynamic quantity

which depends on the objects currently assigned to it.

Initially, all the objects are unassigned. Objects are picked in order of theirBSRfrom the

set of all objects and assigned to arrays picked in order of their spareBSRfrom the set of all

arrays. If a valid assignment is found for all the objects, wehave a balanced configuration.

Note that thespareBSRof the array an object is assigned to is updated appropriately after

each valid assignment.

The intuition behind usingBSRas a metric is that assigning highBSRobjects to arrays

with a highspareBSRpossibly results in a better utilization of bandwidth per unit space

in the system, and hence a tighter packing. A tighter packingincreases the likelihood of

finding a balanced configuration.

112

5.3.2 Cost-aware Object Remapping

In this section, we present two cost aware algorithms for searching a balanced configu-

ration. The first of these is a randomized algorithm and the second is a deterministic greedy

algorithm. Both approaches start with the current configuration and change the mapping

of the objects incrementally until a balanced configurationis achieved. Thus, these ap-

proaches use the current configuration as ahint, and aim to retain most of the objects in

place, possibly resulting in a lower cost of reconfiguration.

5.3.2.1 Randomized Object Reassignment

This approach is similar in principle to the randomized approach described in Section

5.3.1.1, except it starts with the current configuration. Given the current configuration,

a random permutation of objects on all the overloaded arraysis created. These objects

are then assigned, in order, to underloaded arrays picked atrandom from the set of all

underloaded arrays. It is possible that all the underloadedarrays need to be tried before a

valid assignment is found for an object. This is done until a fractionfra of objects have

been considered, or the system has reached a balanced configuration.

Once an overloaded array becomes underloaded, the objects on the now underloaded

array are not considered for reassignment. The overloaded array is now considered as

underloaded for load balancing purposes. This procedure could be repeated multiple times,

with different permutations, and of multiple trials which result in a balanced configuration,

the one with the least cost is chosen.

Again, as opposed to a completely randomized approach, there is a best-fit and a worst

fit variant of the algorithm. The variants are similar to thatdescribed for the approach in

Section 5.3.1.1.

Drawbacks: In Section 5.3.1 we presented two approaches which did not take the current

configuration into account while searching for a balanced configuration and are typically

associated with a large cost of reconfiguration. However, they are useful for initial storage

113

system configuration. The randomized object reassignment approach described above starts

with the current configuration. This approach, however, also has two drawbacks:� Possibly high reconfiguration cost: In this approach, the object which is to be reas-

signed, is picked at random. Since, the search is not exhaustive it can still result in a

large amount of data being moved or may fail to find a balanced configuration.

Even though an exhaustive search is not feasible, choosing an object as well as the

array to which it is to be assigned carefully, taking into account their respective space

and bandwidth attributes, could be beneficial.� Simple reassignment: If the storage system does not have the right combination of

spare space and spare bandwidth on the constituent arrays, asimple reassignment

of objects may not yield a balanced configuration. Barring a reconfiguration from

scratch, which may entail a high cost, a low cost reconfiguration in such scenarios

would have to involveswappingobjects, or groups of objects, between arrays. The

diverse space and bandwidth requirements of the objects, coupled with diverse space

and bandwidth constraints on arrays that comprise the storage system, makes this

non-trivial.

In the following section, we present an approach to address these drawbacks.

5.3.2.2 Displace and Swap

The key idea in this approach is togreedilydisplace excess bandwidth from overloaded

arrays to underloaded arrays. The goal is to identify a set ofobjects, while taking into

account their sizes, that need to be moved from their original location in order to attain a

balanced configuration.BSRis used as a guiding metric in order to minimize the amount

of data that needs to be displaced. Here, by object size we refer to the storage space

requirement of an object.

There are two basic steps which comprise this approach. The first is referred to as

displaceand involves reassigning objects from overloaded arrays tounderloaded arrays.

114

The second step, referred to asswap, involves swapping objects between overloaded and

underloaded arrays. The second step is invoked only if the first step alone does not yield

a balanced configuration. The goal is to first offload as much excess bandwidth on an

overloaded array using one way object moves (displace), andif this does not suffice, search

for two way object moves (swap). The intuition is that one wayobject moves, on the

average, would require less data movement than a solution involving two way object moves.

One way object moves are also preferable to two way object moves as they do not require

any scratch space2 to achieve the reconfiguration.

Displace: In this step, the goal is to use any spare space on the underloaded arrays

to accommodate objects from overloaded arrays and thus offload excess bandwidth. Only

underloaded arrays with spare space are considered as potential destinations during object

reassignment.

Since the goal is to remove excess bandwidth from each overload array while moving

the least amount of data, we consider objects from each overloaded array one by one. This

allows us to optimize for the amount of data displaced from each overloaded array.

The overloaded arrays themselves could be considered in anyorder. To achieve a bal-

anced configuration, the bandwidth utilization on all the overloaded arrays needs to be

reduced below the corresponding threshold. So, we considerthe overloaded arrays in de-

scending order of the magnitude of bandwidth violation (�ibi ij � �j � Bj). This has the

advantage that if the displace step is unable to identify a balanced configuration, there is

less bandwidth that needs to be moved off each overloaded array, on the average, in the

swap step.

Finally, for a given overloaded array, objects on the array are considered for reassign-

ment in descending order of theirBSR. This is in order to minimize the amount of data

displaced, as of all objects, the object with the maximumBSRdisplaces the most band-

2Swapping objects between arrays with little spare storage space may require using scratch storage space.

115

width per unit of data moved. The destination underloaded array for reassigning an object

is chosen to be the one with the maximumspareBSR. The reason is similar to that for the

approach in Section 5.3.1.2. This completes the essence of the displace step.

Object reassignments that remove the hotspot on an overloaded array could be single-

object or multi-object. Any valid single object reassignment that can remove the hotspot

on the overloaded array is referred to as asoloSoln. Any reassignment comprising multiple

objects that removes the hotspot is referred to as agrpSoln. Any reassignment comprising

one or more objects that is not able to remove the hotspot is referred to as asemiSoln. We

refer to bothgrpSolnandsemiSolnassolnfor short.

It is possible that choosing objects for reassignment strictly in order ofBSR, as described

above, results in asoloSolnappearing as a part of agrpSoln. So, we identify allsoloSolns

before searching forgrpSolns.� Identifying a soloSoln: Any object on the overloaded array that can be validly

assigned to some underloaded array, and also removes the hotspot, classifies as a

soloSoln. Any object that can be validly assigned, but does not removethe hotspot,

is put in a setR. The setR, which is devoid ofsoloSolns, is used to identifygrpSolns

in the next step.

A minor optimization is possible here. If the setR consists of objects all of which

are larger than the smallest sizesoloSoln, there is no need to execute the following.

This is because anygrpSolnwould only have a higher cost.� Identifying a soln: In this step we search for agrpSolnusingBSRas the guiding

metric. Given a setR of objects, objects picked in descending order ofBSRare

assigned to underloaded arrays chosen in descending order of spareBSR. This is done

until either all the objects on the overloaded array have been considered, or the set of

reassignments so far is able to remove the hotspot. If the hotspot could be removed,

we have agrpSoln, else we have asemiSoln.

116

In the above step, for identifying asoln, the objects were selected greedily based on their

BSR. However, such a greedy approach could make some wrong choices. These could result

in (i) a higher cost solution, or (ii) inability to remove thehotspot on the overloaded array.

While an exhaustive search is infeasible, the following optimizations try to address at least

some of the wrong choices.

These optimizations essentially involve questioning the choice of each object that com-

prises thesoln. Any soln can be thought of as being comprised of two parts. One, the

highestBSRobject, referred to as theroot. All the remaining objects in thesoln, if any,

comprise the second part. Whereas the first optimization questions the choice of the root,

the second questions the choice of each of the remaining objects that comprise thesoln.

Also, while improving agrpSolnrequires finding another with a lower cost, improving

asemiSolnmeans finding agrpSolnor anothersemiSolnwhich displaces more bandwidth.� Optimization 1: Identifying multiple solns . In this optimization, we identifysolns

with different elements in the setR as root. Note, that for a given root only objects

with a lowerBSRthan the root are considered for reassignment. This optimization

gives us multiplesolns. The number of suchsolns equals the number of objects in

the setR.� Optimization 2: Backtracking . This optimization involves backtracking on the

remaining objects that comprise asoln. We employ this optimization to improve each

of the solns identified in the optimization above. Each backtracking step involves

searching for a newsolnwhile not considering an object that is part of thesoln. This

is done for all the objects that comprise thesolnexcepting the root (the root has been

optimized for in the previous step).

If backtracking results in a bettersoln, backtracking on the previoussoln is discon-

tinued and restarted for this newsoln. It is possible that this procedure continues to

117

yield successively bettersolns. To limit the computational costs we explore only a

constant number of these.

Note that the above optimizations result in a strategy whichlies somewhere between a

purely greedy approach and one that exhaustively considersevery combination.

If the above results in multiplegrpSolns or soloSolns, the one with the least cost is

chosen3. If, however, the above only results in object reassignments which reduce the

bandwidth violation on the overloaded array (i.e., onlysemiSolns), the one which displaces

the most bandwidth is chosen4. The mapping parameters ( ijs) for the objects to be re-

assigned are adjusted appropriately. Note that this modified configuration serves as the

starting configuration for the next overloaded array considered.

Once all the overloaded arrays have been considered, and thesystem is still not bal-

anced, the swap step, which is described next, is invoked.

Swap: Displace works only when there is sufficient storage space on the underloaded

arrays to accommodate objects from the overloaded arrays. In the absence of sufficient

spare space, a low cost reconfiguration technique would require swapping objects between

arrays. Such swaps could be two-way i.e., involve two arrays, or they could be multi-way.

In this chapter, we describe a strategy for identifying two-way swaps.

In this step, the goal is to identify valid swaps of objects, or groups of objects, between

overloaded and underloaded arrays, such that the bandwidthutilization on the overloaded

array is reduced. By successively identifying such swaps, we can remove the hotspot on an

overloaded array.

BSRis again used as the guiding metric. By swapping highBSRobjects on an overloaded

array with lowBSRobjects on a underloaded array, maximum bandwidth is displaced per

unit of data moved.

3Ties are broken by choosing the one which displaces the leastbandwidth as this leaves more sparebandwidth on the underloaded array to accommodate future object moves.

4In this case, ties are broken by choosing the semiSoln with the least cost.

118

Swaps are searched for between a pair of an overloaded array and an underloaded array.

Since, all arrays need to be underloaded for a balanced configuration, overloaded arrays are

considered in descending order of the magnitude of bandwidth violation. For each over-

loaded array, underloaded arrays are considered in descending order of spare bandwidth.

This is done so that possibly maximum bandwidth is displacedfor each pair considered.

The diverse space and bandwidth attributes of the objects and arrays make identifying

valid swaps non-trivial, so we use a simple greedy approach guided by theBSRof the

objects. Before we describe how a swap is identified, let us define what classifies as a valid

swap.� Valid swap: While a swap is valid if it does not violate the constraints onthe un-

derloaded array and decreases the bandwidth utilization onthe overloaded array. It

is not useful if this decrease is not significant. So, we definea parameterufrac

which quantifies the utility of a swap. LetbwO and bwU be the cumulative band-

width requirement of the sets of objects from the overloadedand underloaded array,

respectively, which are to be swapped. Then for a swap to be valid we require that:bwO � bwU � ufra � bwU (5.5)

In other words, the decrease in bandwidth on the overloaded array as a fraction of the

bandwidth moved off the underloaded array should exceed a certain minimum.

We classify the constraints that need to be satisfied for a swap to be valid as follows:� Constraint C1: The swap should satisfy the bandwidth and space constraints on the

underloaded array.� Constraint C2: The swap should have a certain minimum utility (equation 5.5).� Constraint C3: The swap should satisfy the space constraint on the overloaded array.

119

We now describe the approach for identifying a valid swap.� Identifying a valid swap: While simply considering all pairs of objects, one each

from an overloaded array and an underloaded array at a time may not result in any

valid swap, considering every combination of objects from the two arrays is infea-

sible. We present a simple greedy approach to swap the equivalent of a highBSR

object from the overloaded array with the equivalent of a lowBSRobject from the

underloaded array. Identifying such a swap also displaces more bandwidth per unit

of data moved.

To identify such a swap, objects on the overloaded and underloaded arrays are sorted

in descending and ascending order ofBSR, respectively, to give setsLOlv andLUlv,respectively. First pairs of objects from these two orderedsets are considered. Each

object inLUlv is considered, in order, for each object inLOlv, in order. If a pair meets

the constraints for a valid swap objects are swapped.

If after considering all pairs the array is still overloaded, we seek to identify contigu-

ous sets of objects from these two ordered sets which constitute a valid swap. Note,

that these contiguous sets of objects are the equivalent of ahigh BSRand lowBSR

object, respectively.

Ideally it is desirable that contiguous sets of objects fromthese two sets be identi-

fied. However, it is possible that no such sets can be identified that satisfy all the

constraints for a valid swap. So, in the procedure describedbelow we first (step 1)

identify contiguous sets that satisfy two of the constraints; if these contiguous sets do

not satisfy the third constraint, possibly non-contiguousobjects are picked in order

to meet the constraint (step 2).

Let LOsw andLUsw denote the sets of objects from the overloaded and underloaded

arrays, respectively, that are to be swapped.

120

– Satisfy C1 and C2: Contiguous elements from the ordered setsLOlv andLUlv,respectively, are incrementally added to the setsLOsw andLUsw, respectively, untilC1 andC2 have been satisfied. This gives a valid swap ifC3 is also satisfied.

– Satisfy C3: If C3 has not been satisfied, additional objects from the setLOlv,picked in order, are added to the setLOsw; an object is added only if it does not

result in a violation ofC1 or C2. Objects are added untilC3 has been satisfied.

This may result inLOsw being comprised of non-contiguous elements from the

ordered setLOlv.– Given a valid swap, the ordered setsLOsw andLUsw are updated to reflect the

swap.

– If a valid swap was not found, the above steps are repeated butnow with the

second element in the ordered setLOlv as the first element added to setLOsw, and

so on.

– Swaps are searched for until the hotspot on the overloaded has been removed.

If after executing this step, there are no overloaded arrays, we have a balanced con-

figuration. Note that this simple greedy approach for swapping contiguous sets of objects

between two arrays may be sub-optimal; however, the parameterufracallows some control

over the utility of a swap. Figure 5.2 and the following example together illustrate displace

and swap.

Example The figure illustrates how displace and swap work. Figure (a)shows two ar-

rays with bandwidth utilizations of 100% and 40%, respectively. Each box with a number

indicates an object and an empty box indicates unallocated space. The number in a box

indicates the bandwidth requirement of the object. For simplicity, all objects are assumed

to be of unit size; so the bandwidth requirement of an object is also its BSR. The bandwidth

overload threshold� is assumed to be 75% for both the arrays. As Array 1 is overloaded the

displace and swapalgorithm proceeds as follows. The displace step is invokedfirst as the

121

Array 1 Array 2 Array 1 Array 1Array 2 Array 2

10

20

70

100%

20

20

40%

Displace

70

10

20

20

70

10

After

Displace20

100% 40%

20

20

20

(a) (b) (c)Swap

After

Swap

(e) (d)

80% 60%

80% 60%70%70%

20

2020

20

20

70

10

70

10

20

Array 1 Array 1Array 2 Array 2

1 11

1

1

Figure 5.2. Illustration of Displace and Swap.

underloaded array has one unit spare space. Figures (b) and (c) illustrate an object being

moved from Array 1 to Array 2. The object selected is one with the BSR of 20. The object

with BSR 70 could not be accommodated on the underloaded array due to bandwidth con-

straints. After the displace step since Array 1 is still overloaded the swap step is invoked.

Figures (d) and (e) illustrate an object with BSR 10 being swapped with an object with

BSR 1. Note that first, pairs of objects are considered; the object with BSR 70 on Array 1

could not be swapped with any object on Array 2 without any constraint violations. Since,

both the arrays are now underloaded the algorithm terminates.

5.4 Measuring Bandwidth Requirements and Detecting Hotspots

In the previous section, we presented techniques for identifying a balanced configura-

tion. The techniques assume that the bandwidth requirementof the objects is known and so

a hotspot can be identified. In this section, we describe techniques for measuring bandwidth

requirements of objects and detecting hotspots in a real storage system.

122

Measuring Bandwidth Requirements: Whereas the space requirement of an object

is fixed at object creation time5, the bandwidth requirement of an object depends on the

current workload. Unless the workload access pattern to theobject is well characterized and

knowna priori throughout the lifetime of the object, its bandwidth requirement needs to

be inferred based on online measurements. We use a simple measurement-based technique

to approximate the bandwidth requirement of each object.

Recall that each object is assumed to be striped across some number ofLUs in a logical

array. Given the request size (in sectors) and the first logical sector requested for each

request, one can infer the number of disks accessed. Note that the number of disks accessed

is upper bounded by the number of disks which comprise the logical array. This technique

requires that the number of disks each object is striped overand theRAID level of the

componentLUs be known. Given the average latency and transfer rate for the underlying

disk, if a requestreq results inIO ountreq independent disk accesses andSe torCountreqis the number of sectors requested, the percentage bandwidth utilization of a logical array

over a time windowI due to accesses to objectLi is given by:�req�(I;Li)(IO ountreq � (tseek + trot)+ Se torCountreq=rxfr)=(I � numDisks) (5.6)

Here, the summation is overreq � (I; Li) i.e., requests that accessed objectLi and com-

pleted in the time windowI. numDisksis the number of disks in the underlying logical array.tseek, trot andrxfr are the average seek time, average rotational latency and average transfer

rate, respectively, for the underlying storage device. Theabove expression computes the

diskhead busy time per unit time per disk due to requests accessing objectLi in a time

durationI, thus giving the array utilization due to accesses to the object.

5Note that the space requirement refers to the size of the corresponding logical volume and not the theactual storage space in use. Moreover, extending a logical volume is an infrequent operation and a consequentchange in space requirement is easily accommodated.

123

We use this utilization figure as a measure of the bandwidth requirement of a an ob-

ject. Note that this is the perceived bandwidth requirementof the object and assumes that

the workload accessing the object is able to express itself in the presence of inter-object

interference i.e., accesses to other objects on the same logical array. Moving the object to

a similar array with less load may result in a different bandwidth utilization.

A limitation of our approach is that it works only for similarlogical arrays. An approach

used in practice is theIOPS measure for characterizing object bandwidth requirementsand

array bandwidth capacity. A limitation, however, of such a characterization is that it im-

plicitly assumes a basic transfer size or amount of data accessed for an IO. For objects with

different stripe unit sizes mapped to the same array such a technique may not be accurate.

Our approach does not have this drawback.

Identifying Hotspots: The above technique gives the bandwidth utilization on an array

due to an object mapped to it. The bandwidth utilization of the logical array can be now

be approximated as the summation of the bandwidth utilizations of all the objects mapped

to the array. An array is overloaded if its bandwidth utilization exceeds a certain threshold

(equation 5.2).

An approach which offers flexibility in defining a hotspot is one using percentiles. The

bandwidth utilization is averaged over an intervalI, and an overload is signaled if a per-

centile (perc) utilization over the samplesbWI in a time windowW, exceed the threshold.

Since the utilization for each logical volume is computed separately (equation 5.6), one can

compute this percentile for each logical volume and use their summation as the measure of

the bandwidth utilization of the array.

5.5 Implementation Considerations

We have implemented our techniques in the Linux kernel version 2.6.11. Our prototype

consists of two components: (i) kernel hooks to monitor IO completions for each logical

volume, and (ii) a user space reconfiguration module which uses statistics collected in the

124

kernel to estimate bandwidth requirements, computes a new configuration if a hotspot is

detected, and migrates the requisiteLVs appropriately.

Our prototype was implemented on a Dell PowerEdge server with two 933 MHz Pen-

tium III processors and 1 GB memory that runs Fedora Core 2.0.The server contains an

Adaptec3410S U160 SCSI Raid Controller Card that is connected to two Dell PowerVault

disk packs which comprised 20 disks altogether; each disk isa 10,025 rpm Ultra-160SCSI

FujitsuMAN3184MC drive with 18 GB storage.

The kernel portion of the code involved adding appropriate code and data structures to

enable collecting statistics for eachLV . The 2.6 kernel usesbio as the basic descriptor for

IOs to a block device. On IO completion a routinebio endio is invoked by the device

interrupt handler. It is here that we do the bookkeeping for each LV separately. This is

facilitated as eachLV created using the Linuxlogical volume manager(LVM ) has a separate

device identifier; the device identifier for which the IO was performed is available in the

bio descriptor.

The user space reconfiguration module makes a system call periodically to query the

statistics from the kernel. The statistics are namely thesectorCountandIOCount(see Section

5.4) which are used to approximate the bandwidth requirement of an LV . The system call

also automatically resets the kernel statistics. We also provide two additional system calls

which allow selective enabling and disabling of statisticscollection for anLV . Statistics

collection is enabled by default for anLV when it is activated (inLVM terminology), and

is thus registered with the kernel. Deactivating anLV automatically disables the statistics

collection for the same. Finally, note that the implementation involved using appropriate

kernel synchronization primitives since the same data structure is accessed by the user

space reconfiguration module (via system calls) when querying statistics and by the device

interrupt handler on an IO completion. A separate synchronization primitive was employed

for each logical volume to improve concurrency.

125

If the reconfiguration module detects a hotspot, it invokes appropriate routines to iden-

tify a balanced configuration. If a balanced configuration isfound the logical volumes

are migrated appropriately. We use tools provided by the Linux Logical Volume Man-

ager(LVM ), namelypvmove, to achieve data migration while theLVs are online and being

actively accessed. The user application continues to work uninterrupted throughout the

migration, except for possibly some performance impact while the reconfiguration is in

progress.

Finally, since we collect statistics only for IOs actually issued to the block device, any

hits in the buffer cache are transparently handled. Our current implementation does not

account for hits in other caches (disk cache and controller cache).

It is possible that disk accesses for separatebio requests get merged at the disk level.

This would mean that the value ofIOCountwould be overestimated. To account for this, in

our implementation, separatebio requests which correspond to contiguous logical sectors

and complete within a short time window are treated as one large request. This ensures that

the IOCountestimate is more in tune with the actual value.


In this Section, we first compare different object remappingtechniques using algorith-

mic simulations. We then present experimental results fromthe evaluation of our prototype

implementation. Since, our prototype is limited by the hardware configuration, algorith-

mic simulations help exhaustively evaluate the performance of different approaches for a

variety of system configurations.

5.6.1 Simulation Results

We used an algorithmic simulator to compare the different algorithms for object remap-

ping described in Section 5.3. The simulator implements allthe algorithms and when in-

voked for an imbalanced configuration reports the cost of reconfiguration for each.

126

We seek to study the performance of different algorithms as different system parameters

are varied. The parameters varied were the system size, the initial system bandwidth and

space utilization, and the magnitude of the bandwidth overload. We also study the impact

of the optimizations developed for thedisplacealgorithm.

The default storage system configuration in our simulationscomprised four logical ar-

rays, each with 20 18 GB disks. Each logical array in the system was configured to have

an initial storage space and bandwidth utilization of 60% and 50%, respectively.

To achieve a specified storage space utilization on an array,objects were assigned to an

array until the desired space utilization had been reached.The object sizes were assumed

to be uniformly distributed in the range [1 GB,16 GB]; the object size was assumed to

be a multiple of 0.25 GB. To achieve a specified bandwidth utilization, first bandwidth

requirement values were generated, one for each object, andin proportion to the object

size. A random permutation of these values was then generated and a value assigned to

each object in the array. This procedure resulted in a configuration with the desired values

of storage space and bandwidth utilization for each array and no correlation between object

size and object bandwidth requirements. Note that the default system parameters resulted

on an average 25 objects assigned to each logical array, and thus an average of 100 objects

in the storage system (comprised of four arrays).

To generate an imbalanced configuration, we increased the bandwidth utilization on

half the arrays in the system until a desired magnitude of overload had been reached. This

resulted in a storage system with half the arrays overloadedand half with spare storage

bandwidth. Here,magnitude of overloadrefers to the average of the bandwidth violation

across all arrays in the system. To create an overload, we picked an object at random

from one of the arrays that is to be overloaded, and increasedits bandwidth requirement

by an amount�bw; for a given system configuration�bw was chosen to be the bandwidth

requirement of the least loaded object in the system. This procedure was repeated until

the desired magnitude of overload had been attained. For ourexperiments, the default

127

bandwidth violation threshold was chosen to be 80%, and the default magnitude of overload

was fixed at 5%.

For each experiment, the performance figures reported correspond to an average over

100 runs i.e., correspond to the average cost of reconfiguration for 100 imbalanced config-

urations for the same choice of system parameters. Thenormalized data displacedfigure

reported in the following experiments is the total amount ofdata displaced (equation 5.3)

as a percentage of the total data in the system i.e., the storage space allocated to all the

objects put together.

5.6.1.1 Impact of System Size

In this experiment, we study the impact of the system size on the cost of reconfigu-

ration. We vary the system size i.e., the number of logical arrays, from 2 to 10. This

resulted in systems with number of disks ranging from 40 to 200. Figures 5.3(a) and 5.3(b)

show the performance of the cost-aware and cost-oblivious approaches, respectively, with

varying system size. The graphsRandom PackingandBSRcorrespond to the cost-oblivious

approaches presented in Sections 5.3.1.1 and 5.3.1.2, respectively. The graphsRandom Re-

assignandDSwapcorrespond to the cost-aware approaches presented in Sections 5.3.2.1

and 5.3.2.2, respectively.

Figure 5.3(a) shows thatDSwapoutperformsRandom Reassign. Moreover, while the

normalized reconfiguration cost with an increase in system size remains constant for the

former, it increases for the latter. The higher cost of reconfiguration for theRandom Reassign

algorithm is because of the randomized nature of the algorithm. With increasing system

size the number of possible objects to choose from goes up. The normalized cost remains

constant for theDSwapalgorithm as objects are chosen from an overloaded array carefully

based on theirBSRvalues, and so increasing the system size does not increase the cost.

Note, however, that the absolute amount of data displaced does increase with an increase

in the system size.

128

0

1

2

3

4

5

2 4 6 8 10

Dat

a D

ispl

aced

(N

orm

aliz

ed)

Number of Arrays

Impact of System Size

Random ReassignDSwap

0

20

40

60

80

100

2 4 6 8 10

Dat

a D

ispl

aced

(N

orm

aliz

ed)

Number of Arrays

Impact of System Size

BSRRandom Packing

(a) Cost-aware (b) Cost-oblivious

Figure 5.3. Impact of system size.

Figure 5.3(b) shows the cost of the reconfiguration for the cost-oblivious approaches.

Since, both approaches reconfigure the system from scratch,the cost of reconfiguration is

significantly higher as compared to that of the cost-aware approaches. In both cases, the

cost of reconfiguration increases with an increase in the system size because the probability

that an object gets remapped to its original array decreases. Random Packinggives a cost

lower thanBSRbecause it is semi cost-aware (see Section 5.3.1.1). For therest of the

experiments described in this Section, the cost-obliviousapproaches resulted in a similarly

high cost of reconfiguration as compared to the cost-aware approaches and so we do not

present the results for the same.

In our experiments, for theRandom Reassignapproach we set the fraction of objectsfrac

(see Section 5.3.2.1) considered for reassignment from theset of objects on all the over-

loaded arrays to be 1.0 i.e., all the objects were considered. Also, for both the randomized

algorithms,Random ReassignandRandom Packing, the balanced configuration chosen was

the one with the least cost from among 100 runs with differentseed values.

129

0

0.5

1

1.5

2

2.5

3

3.5

4

50 55 60 65

Dat

a D

ispl

aced

(N

orm

aliz

ed)

Array Bandwith Utilization (%)

Impact of Initial System Bandwith Utilization


0

1

2

3

4

5

6

7

8

2 4 6 8 10

Dat

a D

ispl

aced

(N

orm

aliz

ed)

Mean Bandwidth Overload (%)

Impact of Average Bandwidth Overload


(a) Initial bandwidth utilization (b) Magnitude of overload

Figure 5.4. Impact of bandwidth utilization.

5.6.1.2 Impact of System Bandwidth Utilization

In this experiment, we studied the impact of the bandwidth utilization on the cost of re-

configuration. Figure 5.4(a) and 5.4(b) show the impact of the initial bandwidth utilization

and the magnitude of bandwidth overload, respectively.

Figure 5.4(a) shows that as the initial bandwidth utilization is increased from 50% to

65%, the normalized cost remains unchanged for both approaches. This is because in-

creasing the initial system bandwidth utilization merely increases the initial bandwidth

requirement of all the objects in the system proportionately. This reduces the fraction of

objects that can be reassigned to the underloaded array without any constraint violations.

The normalized cost of reconfiguration, however, does not change significantly, as each ob-

ject reassignment, on the average, now displaces more bandwidth. Note thatDSwapresults

in a factor of two lower cost as compared to theRandom Reassignapproach due to reasons

similar to that described in the previous experiment.

Figure 5.4(b) shows that with an increase in the magnitude ofoverload from 2% to 10%,

the cost of reconfiguration increases for both approaches. This is because, on an average,

more data needs to be displaced for a higher magnitude of overload. The rate of increase

in the normalized cost is greater forRandom Reassign, as compared toDSwap, as the objects

130

are chosen for reassignment at random in the former approach, while in the latter approach

objects are considered for reassignment based on theirBSRvalues.

5.6.1.3 Impact of System Space Utilization

In this experiment, we studied the impact of varying system space utilization on the cost

of reconfiguration. Figure 5.5(a) shows that the normalizedcost of reconfiguration remains

almost unchanged for both approaches as the the system spaceutilization is varied from

60% to 90%. This can be attributed to the fact that increasingthe system space utilization

increases the number of objects on each array. Consequently, for a fixed value of the initial

bandwidth utilization the bandwidth requirement of the objects on an array decreases with

an increase in the system space utilization. While this may require that more objects need

to be reassigned to remove the same bandwidth overload, an increase in the system space

utilization results in the normalized cost remaining largely unchanged. Note thatDSwap

results in a cost which is a factor of two less thanRandom Reassign.

We see a slight increase followed by a slight decrease in the cost for theRandom Reassign

approach. This is because an increase in the space utilization increases the number of

objects to choose from for reassignment. The slight decrease that follows is because at

higher space utilizations the number of objects that can be reassigned decreases as the space

constraints on the underloaded arrays become a significant factor. The slight decrease in the

cost for theDSwapapproach is because with an increase in the system space utilization the

fraction of objects on the overloaded arrays that can be accommodated on the underloaded

arrays decreases.

Figure 5.5(b) shows the percentage of times a balanced configuration was identified for

different imbalanced configurations generated for the samechoice of parameters. TheBSR

approach, which reconfigures from scratch, fails to find a balanced configuration all the

time when the system space utilization is 90% because of its deterministic nature.Random

Reassignfails at 95% system space utilization, as there is little spare storage space on the

131

0

0.5

1

1.5

2

2.5

3

3.5

4

60 65 70 75 80 85 90

Dat

a D

ispl

aced

(N

orm

aliz

ed)

System Space Utilization (%)

Impact of System Space Utilization


0

20

40

60

80

100

60 65 70 75 80 85 90 95 100

Per

cent

age

Sol

utio

n F

ound

System Space Utilization (%)

Impact of System Space Utilization

DSwapBest-fit Random Packing

Worst-fit Random PackingRandom Packing

Random ReassignBSR

(a) Cost (b) Percentage Solution Found

Figure 5.5. Impact of space utilization.

underloaded arrays; recall that this approach onlyreassignsobjects from overloaded ar-

rays to underloaded arrays. The best-fit and worst-fit variants of this algorithm performed

similarly, and so have not been shown in the figure.

At 100% system space utilizationRandom Packingand its variantWorst-fit Random Pack-

ing fail to find a balanced configuration all the time. Only,DSwapand Best-fit Random

Packingapproaches were able to find a balanced configuration for all overloaded configu-

rations. As expected the best-fit variant ofRandom Packingis able to identify a balanced

configuration in constrained scenarios (see Section 5.3.1.1). DSwapis able to identify a

balanced configuration as it swaps objects between arrays. Note that in these experiments

ufrac (equation 5.5) was chosen to be 0.50.

5.6.1.4 Impact of Optimizations

In this experiment, we study the impact of the optimizationsdeveloped for thedisplace

step of our approach (see Section 5.3.2.2). Here, we presentthe results in the context of

variation of system size. We conducted experiments to studythe impact of the optimiza-

tions when various system parameters were varied; the results were similar to that for the

system size experiment, and so we omit those to avoid repetition.

132

0

0.5

1

1.5

2

2 4 6 8 10

Dat

a D

ispl

aced

(N

orm

aliz

ed)

Number of Arrays

Impact of Optimizations

DSwap NoOpt.DSwap Opt. 1

DSwap (Opt. 1 + Opt. 2)

Figure 5.6. Impact of optimizations.

Recall that the first optimization involved choosing from among multiple possible groups

of objects to remove the overload on a underloaded array. Thesecond optimization used

backtracking to improve thesolnfor each overloaded array.

Figure 5.6 shows the impact of the various optimizations as the system size is varied.

As can be seen in the figure,DSwapwithout any optimizations (NoOpt.) has the maximum

normalized cost. Introducing the first optimization (Opt. 1) results in a marginal improve-

ment in the cost. The improvement is more pronounced when both the optimizations (Opt.

1 + Opt. 2) are employed. This is because while the first optimization questions only the

choice of theroot of the thesoln, the second optimization uses backtracking to question

the choice of each of the subsequent objects that comprise the soln thus resulting in more

significant gains.

Note, however, that even this marginal improvement can be significant as the actual

amount of data that needs to be displaced can be significantlydifferent in the three cases

as the system size is increased. Finally, the role of the optimizations can be particularly

significant as compared to a purely greedy approach for some specific imbalanced config-

urations.

133

5.6.2 Prototype Evaluation

In this Section, we demonstrate the effectiveness of our approach by conducting exper-

iments on our Linux prototype. The goal in these experimentswas two-fold; (i) to show

that the kernel measurement techniques are able to identifya hotspot, and (ii) to demon-

strate that the reconfiguration module makes the correct choice when selecting objects and

underloaded arrays to remove the hotspot.

In our experiments, we use a simple synthetic workload whichprovides us a great

degree of control in imposing a desirable amount of IO load onthe storage system. The

workload for each logical volume was defined using two parameters, (i) concurrencyN ,

and (ii) mean think timeIA. A workload with concurrencyN consists ofN concurrent

clients; each client issues a request and sleeps for a time interval exponentially distributed

with a mean ofIA on request completion before issuing a new request. The request sizes

were assumed to be fixed and successive requests access random locations on the logical

volume. The request size for client requests was fixed at 16KBin our experiments. Note,

that while the think time provided control over the load imposed by each client, the client

concurrency allowed us to independently control the load being imposed on an array due

to accesses to anLV .

The characteristics of the host and the storage system in ourprototype were as described

in Section 5.5. We partitioned the 20 disks in the system to give five striped logical arrays

each comprising four disks and with a stripe unit size of 16KB. Each array was partitioned

into 14 partitions of size four GB each6. These partitions served as building blocks (in the

form of LVM physical volumes) for theLVs created on each array; so, theLV size was a

multiple of 4 GB. Of the five logical arrays, one array was configured without any logical

volumes and was used as scratch space when swapping logical volumes between arrays.

6Note that each striped array is visible as a SCSI drive on the host. Since we wanted to utilize maximumpossible storage space on each logical array, and Linux allows only 14 usable partitions for a SCSI drive, wecreated 14 four GB partitions for a total of 56 GB allocatablestorage space on each array.

134

We set the bandwidth overload threshold for all the arrays to50% in our experiments.

The intervalI over which the bandwidth utilization was approximated was chosen to be10s and the window sizeW over which reconfiguration decisions were made was chosen

to be100s. Note that these values are small, and were chosen to speed experimentation.

Load-balancing involving object remapping is a long-term operation and is typically done

only over periods of months or more. Finally, we used the 70th percentile of the samples

accumulated in the time windowW as a measure of the bandwidth utilization of anLV .

As mentioned before, the data migration could be achieved either online or offline.

While our implementation supports online data migration, to speed up experimentation,

we simply reconfigure the arrays for the new mapping of logical volumes. Techniques to

control the rate of online data migration to mitigate the performance impact on foreground

applications have been presented in [39].

We conducted two sets of experiments, one where theLV size or object size of all the

objects on the system was the same, and another where the object sizes differed.

5.6.2.1 Uniform Object Size

For the case of uniform object sizes, we present results fromtwo experiments, one were

the system had spare storage space in the form of unallocatedpartitions, and another where

the system had no spare space. While in the former, thedisplacestep would be invoked,

the latter would invoke theswapstep. In both experiments, the size of anyLV on an array

was 4 GB.

In the first experiment, three arrays were configured with 14LVs each, and the fourth

array was configured to have sevenLVs thus leaving half the array empty with seven al-

locatable partitions. For the first 100 seconds all theLVs in the system were accessed by

a workload with a concurrency of two; the mean think time was fixed at 400ms, 1000ms,

1000ms and 500ms for the workload accessingLVs on the four arrays, respectively. Figure

5.7(a) shows the estimated average bandwidth utilization as well as the cumulative average

135

0

20

40

60

80

100

50 100 150 200 250 300

Per

cent

age

Ban

dwid

th U

tiliz

atio

n

Time (secs)

Bandwidth Utilization

Array 0Array 1Array 2Array 3

Overload Threshold

0

50

100

150

200

250

300

350

400

50 100 150 200 250 300

Cum

ulat

ive

Arr

ay IO

PS

(IO

s pe

r se

cond

)

Time (secs)

IOPS


(a) Spare storage space.

0

20

40

60

80

100

50 100 150 200 250 300

Per

cent

age

Ban

dwid

th U

tiliz

atio

n

Time (secs)



Overload Threshold

0

50

100

150

200

250

300

350

400

50 100 150 200 250 300

Cum

ulat

ive

Arr

ay IO

PS

(IO

s pe

r se

cond

)

Time (secs)

IOPS


(b) No spare storage space.

Figure 5.7.Uniform object size

IOPs across all theLVs for each array as a function of time. The average values reported

are over 10 second intervals.

As can be seen the array utilizations are dictated by the meanthink time values for the

workload accessing the componentLVs; lower mean think times mean higher utilization

values. Also note that the average bandwidth utilization estimated using kernel measure-

ments, and the average IOPs on each array based on measurements at the application level,

follow the same trend. This indicates that our kernel measurements correctly track the

application behavior.

136

At t = 100s we increased the workload onArray 0 by increasing the concurrency of

half the clients to nine and that of the other half to four. This results in an increase in the

bandwidth utilization on the array. Note that the bandwidthutilization on the remaining

arrays remains unchanged. Att = 200s the reconfiguration module detects thatArray 0 is

overloaded, identifies a new balanced configuration, and triggers the appropriate reconfig-

uration. The reconfiguration involved moving twoLVs fromArray 0 to Array 3.

The reconfiguration module correctly identifiedArray 3 as the destination for theLVs,

even thoughArray 1 and2 had a lower bandwidth utilization, as it was the only array with

spare storage space. Moreover, of theLVs, it correctly chooses two of the seven logical

volumes being accessed by a workload with concurrency of nine. This choice minimizes

the amount of data displaced.

The graph fromt = 200s to t = 300s shows the utilization of the arrays after the recon-

figuration. As can be seen, the utilization ofArray 0 has decreased to a value close to the

overload threshold. The utilization ofArray 3, which now consists of two additional logi-

cal volumes, has increased appropriately. In our experiments, we allow for a soft threshold

of 2% around the bandwidth violation threshold, and consequently, no more reconfigura-

tions are triggered. This is done in order to avoid a reconfiguration for minor bandwidth

violations.

For the second experiment, we configured the storage system with no spare space and

each array comprised 14LVs. The workload for theLVs onArray 0 was the same as in the

previous experiment. The mean think times for workloads on Arrays 1 through 3, however,

were chosen to be 500ms, 500ms and 1000ms, respectively. Theclient concurrency of the

workload was fixed at two. Figure 5.7(b) plots the bandwidth utilization and cumulative

average IOPs for each array.

In this case, a reconfiguration is again triggered att = 200s as in the case above. The

reconfiguration involved swapping three heavily accessed logical volumes with three log-

ical volumes onArray 3, the array with the least load in the system. Note that despite the

137

0

20

40

60

80

100

50 100 150 200 250 300

Per

cent

age

Ban

dwid

th U

tiliz

atio

n

Time (secs)



Overload Threshold

0

50

100

150

200

250

300

350

400

50 100 150 200 250 300

Cum

ulat

ive

Arr

ay IO

PS

(IO

s pe

r se

cond

)

Time (secs)

IOPS


Figure 5.8. Variable object size; no spare storage space

workload onArray 0 being similar to that in the first experiment, a slightly different ob-

served utilization due to peculiarities of a real system, result in three logical volumes being

swapped, as opposed to two in the first experiment. Consequently, the drop in bandwidth

utilization is greater in this run. The reconfiguration results in a reduction in the bandwidth

utilization on the array to a value below the overload threshold.

5.6.2.2 Variable Object Size

For the case of the storage system configured withLVs with variable size, we ran ex-

periments for both the case when the system had spare space and when there was no spare

space. The results for the case the system had spare space were similar to that in the pre-

vious experiment; when a hotspot occurred the heavily accessed volumes were chosen and

moved to an array with spare space in order to minimize the amount of data displaced. To

avoid repetition, we do not present the results from that experiment.

For the case the storage system had no spare space, the systemconfiguration was as

follows. Array 0 and1 were configured with sixLVs each, twoLVs each of size 4 GB, 8

GB and 16 GB, respectively.Array 1and2 were configured with 14LVs each, each of size

4 GB. The mean think times for the workload accessing theLVs on the four arrays were

300ms, 500ms, 500ms and 1000ms, respectively. For the first 100s of the experiment, the

138

concurrency for the workload for all theLVs on the system was fixed at two. Figure 5.8

shows the bandwidth utilization and cumulative IOPs as a function of time.

At t = 100s we increased the client concurrency for the workload accessing theLVs on

Array 0 and1 to seven and four, respectively. As can be see in the figure, this results in an

increases in the bandwidth utilization on both the arrays. However, onlyArray 0 observes

a violation of the bandwidth threshold. Att = 200s the reconfiguration module detects a

hotspot and triggers a reconfiguration. The reconfigurationinvolved swapping two 4 GB

LVs with twoLVs of the same size fromArray 3. So, the reconfiguration module correctly

identifiesArray 3as the array with the least load. Also, since all the sixLVs are configured

with the same workload, the twoLVs of size 4 GB are the one with maximumBSR.

The graph fromt = 200s to t = 300s shows that after the reconfiguration the utilization

on Array 0 has decreased to a value below the threshold.Array 3 with two newLVs with a

heavier load observes an increase in the utilization.

5.6.2.3 Implementation Overheads

Our final experiment aimed to study the overheads introduced, if any, due to the kernel

enhancements on application performance. While the computation involved in maintaining

statistics was minimal, the synchronization primitives employed (Section 5.5) may intro-

duce some overheads. So, in this experiment, we vary the number of logical volumes

actively being accessed on each array and compare the application performance for the

case statistics collection is enabled, to that of when statistics collection is disabled in the

kernel. The storage system was configured with 14LVs, each of size 4 GB, per array. The

workload accessing eachLV had a client concurrency of two and the mean think time was

set to zero. Note, that with no think time, each client would issue a new IO request imme-

diately after the previously issued request completes. Consequently, the storage system is

saturated.

139

0

50

100

150

200

250

2 4 6 8 10 12 14

Cum

ulat

ive

Arr

ay IO

PS

(IO

s pe

r se

cond

)No. of Active Logical Volumes

Implementation Overheads

Stats Collection EnabledStats Collection Disabled

Figure 5.9. Impact on application performance.

Figure 5.9 shows the cumulative average IOPs for one of the arrays as a function of

the number ofLVs being actively accessed; each point corresponds to an average value

for a two minute run. The graph for the other arrays was the same. As can be seen, the

average IOPs figure in both the cases is similar. So, there wasnot noticeable overhead on

application performance. Since, the storage system is saturated, the average IOPs value

does not change significantly as the number of active logicalvolumes is varied. The slight

drop in the value in both cases, with an increase in the numberof activeLVs, is because

with a larger number ofLVs being accessed, the average seek latency increases. This is

because, for each additional contiguous partition being accessed on an array, the disk heads

on the component disks have to seek over a larger disk surfacewhen servicing requests.

5.6.3 Summary of Experimental Results

Our experiments show that for a variety of system configurations our novel approach

reduces the amount of data moved to remove the hotspot by a factor of two as compared

to other approaches. Moreover, the larger the system size orthe magnitude of overload the

greater the performance gap. Results from our prototype implementation suggested that our

kernel measurement techniques correctly track application behavior and identify hotspots.

For simple overload configurations considered, our techniques correctly remove the hotspot

140

while minimizing the amount of data displaced. Finally, thekernel enhancements do not

result in any noticeable degradation in application performance.

5.7 Related Work

Algorithms for moving data objects from one configuration toanother in as few time

steps as possible have been presented in [25, 28, 35]. It is assumed that the new final

configuration is known. In our work, we seek to identify the new final configuration which

requires minimal data movement.

Techniques for initial storage system configuration have been presented in [7, 9]. Our

work assumes that the storage system is online, and presentstechniques to reconfigure the

system when workload hotspots occur with minimum data movement.

Load balancing at the granularity of files has been considered in [51]. The work as-

sumes contiguous storage space is available on lightly loaded disks to migrate file extents

from heavily loaded disks. Our work seeks to achieve load balancing at the granularity

of logical volumes and makes no assumptions about the distribution of spare space in the

storage system.

Techniques for moving data chunks between mirrored andRAID5 configurations within

an array based on their load for improving storage system performance have been proposed

in [61]. Our work seeks to achieve improved performance across the storage system by

moving logical volumes between arrays.

Disk load balancing schemes for video objects has been presented in [62]. Video objects

are assumed to be replicated and load balancing is achieved by changing the mapping of

video clients to replicas. In our work, logical volumes are assumed to have no replicas

across arrays and load balancing requires that identifyinga new mapping of data objects to

arrays.

Request throttling techniques to isolate the performance of applications accessing vol-

umes on a shared storage infrastructure have been explored in [33, 40, 57]. We present

141

algorithms to improve storage system performance by migrating entire logical volumes

between arrays.

Finally, while [39] presents techniques for controlling the rate of data migration to

mitigate the instantaneous performance impact on foreground applications during online

reconfiguration, our work seeks to optimize for the scale of reconfiguration which dictates

the duration of performance impact.


In this chapter, we argued that techniques employed to load-balance large scale storage

systems do not optimize for thescale of the reconfiguration—the amount of data displaced

to realize the new configuration.

Reconfiguring the system from scratch can incur significant data movement overhead.

Our novel approach uses the current object configuration as ahint; the goal being retain

most of the objects in place and thus limit the scale of the reconfiguration. We also de-

scribed a simple measurement-based technique for identifying hotspots and for approxi-

mating per-object bandwidth requirements.

Finally, we evaluated our techniques using a combination ofsimulation studies and an

evaluation of an implementation in the Linux kernel. Results from the simulation study

showed that for a variety of system configurations our novel approach reduces the amount

of data moved to remove the hotspot by a factor of two as compared to other approaches.

The gains increase for a larger system size and magnitude of overload. Experimental results

from a prototype evaluation suggested that the measurementtechniques correctly identify


our approach identified a load-balanced configuration whichminimizes the amount of data

moved.

142

CHAPTER 6

SUMMARY AND FUTURE WORK

In this thesis, we argued that improved manageability of a storage system is key to en-

suring its availability. The sheer size of these systems coupled with the complexity and

variability of application workloads that access them and the slew of of storage manage-

ment tasks, however, make storage management non-trivial.Traditionally, storage man-

agement tasks have been performed manually by administrators using a combination of

experience, rules of thumb and trial and error. However, such an approach increases the

chances of a misconfigured or sub-optimally configured system. This motivates the need

for an automated, seamless and intelligent way to manage thestorage resource.

Although, high level planning decisions do require human involvement, tasks such as

storage resource allocation are amenable tosoftware automationakin to aself-managing

systemwhich executes important operations without the need for human intervention.

Moreover, storage management tasks may need to be executed at multiple time scales.

Based on the time period at which management tasks need to be instantiated they can be

classified into three categories: initial configuration, short-term reconfiguration and long-

term reconfiguration.

In this thesis, we considered problems in each category witha focus on techniques for

automating storage resource management. In particular, weconsidered two storage alloca-

tion tasks: storage bandwidth allocation and storage spaceallocation. We now present a

summary of the contributions of this dissertation.

143

6.1 Thesis Contributions

In this dissertation we made the following contributions.

6.1.1 Initial Storage System Configuration: Placement Techniques in a Self-managing

Storage System

The first step in storage management is deciding on a mapping of storage objects to disk

arrays. Object placement decisions are integral in determining application performance and

thus are crucial to the success of a storage system. For a self-managing storage system a

suitable placement technique is one that has low managementoverhead and delivers agree-

able performance.

Object placement techniques are based onstriping—a technique that interleaves the

placement of objects onto disks—and can be classified into two different categories: nar-

row and wide striping. From the perspective ofmanagement complexity, these two tech-

niques have fundamentally different implications. Whereas wide stripingstripes each ob-

ject across all the disks in the system and needs very little workload information for making

placement decisions,narrow stripingtechniques stripe an object across a subset of the disks

and employs detailed workload information to optimize the placement.

In this work, we performed a systematic study of the tradeoffs of narrow and wide

striping to determine their suitability for large-scale storage systems. The work involved

(i) simulations driven by OLTP traces and synthetic workloads, and (ii) experiments on a

40 disk storage system testbed.

The results showed that an idealized narrow striped system can outperform a compa-

rable wide-striped system for small requests. However, wide striping outperforms narrow

striped systems in the presence of workload skews that occurin real I/O workloads; the two

systems perform comparably for a variety of other real-world scenarios. The experiments

indicated that the additional workload information neededby narrow placement techniques

144

may not necessarily translate to significantly better performance, and more specifically

does not outweigh the benefits ofmanagement simplicityinnate to a wide-striped system.

6.1.2 Short-term Storage System Reconfiguration: Bandwidth Allocation

In the context of dynamic bandwidth allocation we developedtwo techniques, one a

measurement-based inference technique and another based on learning.

Self-managing Bandwidth Allocation in a Multimedia File Server:

Large scale storage systems host data objects of multiple types which are accessed

by applications with diverse service requirements. For instance, a multimedia file server

services a heterogeneous mix ofsoft-real timestreaming media and traditionalbest-effort

requests. To provide QoS to both application types, employing a reservation-based ap-

proach, where the storage space is shared but a certain fraction of the bandwidth is re-

served for each class, has certain advantages. By sharing storage resources, the file server

can extractstatistical multiplexinggains; by reserving bandwidth, it can prevent interfer-

ence among classes and meet the performance guarantees of the soft-real time class. Thus,

a reservation-based approach has inherent advantages and flexibility which make it suitable

for a large-scale storage system.

Dynamic workload variations, as seen in modern file servers,may mean that one set

of reservations may not be suitable all the time. To address this limitation, in this thesis,

we develop techniques for self-managing bandwidth allocation in a multimedia file server.

In our scheme, we used online measurements to infer bandwidth requirements and guide

allocation decisions. A workload monitoring module tracked several parameters represen-

tative of the load within each class using amoving histogram. It tracked various aspects of

resource usage from the time a request arrives to the time it is serviced by the disk. Mon-

itored parameters include request arrival rates, request waiting times and disk utilizations

within each class.

145

Requests within the best-effort class desire low average response times, while those

within the real-time class have associated deadlines that must be met. We instrumented an

existing disk scheduling algorithm which takes into account these disparate performance

requirement specifications while enforcing allocations and making scheduling decisions.

A simulation study using NFS file-server traces as well as synthetic workloads demon-

strated that our techniques (i) provide control over the time-scale of allocation via tunable

parameters, (ii) have stable behavior during overload, and(iii) provide significant advan-

tages over static bandwidth allocation.

Learning-based Approach for Dynamic Bandwidth Allocation:

An alternative to a measurement-based inference techniquefor bandwidth allocation

is reinforcement learning. An advantage of usingreinforcementlearning is that no prior

training of the system is required; the technique allows thesystem to learn online. More-

over, a learning based approach can also handle complexnon-linearityin system behavior.

In this problem, we assume multiple application classes each of which specify their QoS

requirement in the form of an average response time goal.


for each system state, computes a cost function and stores these values to guide future

allocations. Although such a scheme is simple to design and implement, it has prohibitive

memory and search space requirements; this is because the number of possible allocations

increases exponentially with increase in the number of classes.

A key contribution of our work was the design of anenhancedlearning based approach

that uses the semantics of the problem to overcome the drawbacks of the naive learning

approach. The technique takes the current system state intoaccount while making alloca-

tion decisions and thereby avoids allocations that are clearly inappropriate for a particular

state; in other words, the optimized technique intelligently guides and restricts the alloca-

tion space explored. The design decisions result in substantial reduction in memory and

search space requirements, making a practical implementation feasible.

146

We implemented these techniques in theLinux kernel. and used the software RAID

driver in Linux to configure the disk array. The results showed that (i) the use of learning

enables the storage system to reduce the impact of QoS violations by over a factor of two,

and (ii) the implementation overheads of employing such techniques in operating system

kernels is small.

6.1.3 Long-term Storage System Reconfiguration: AutomatedObject Remapping












applications. Existing approaches do not optimize for the scale of the reconfiguration,

possibly moving much more data than required to remove the hotspot.

To address this limitation, we developed algorithms to minimize the amount of data dis-

placed during a reconfiguration to remove hotspots in large-scale storage systems. Rather

than identifying a new configuration from scratch, which mayentail significant data move-

ment, our novel approach uses the current object configuration as ahint; the goal being

to retain most of the objects in place and thus limit the scaleof the reconfiguration. To

minimize the amount of data that needs to be moved we used agreedyapproach that uses

thebandwidth to space ratio(BSR) as a guiding metric. For example, by greedily select-

147

ing high BSR objects for reassignment, one can displace morebandwidth per unit of data

moved. Finally, we used various optimizations, including searching for multiple solutions,

to counter some of the pitfalls in a greedy approach.

We evaluated our techniques using a combination of simulation studies and an evalua-

tion of an implementation in the Linux kernel. Results from the simulation study suggest

that for a variety of system configurations our novel approach reduces the amount of data

moved to remove the hotspot by a factor of two as compared to other approaches. The

gains increased for a larger system size and magnitude of overload. Experimental results

from the prototype evaluation suggested that our measurement techniques correctly identify





6.2 Future Work

In this section, we discuss some future research directions.� Dynamic Bandwidth Allocation: In this thesis, we addressed the problem of dy-

namic bandwidth allocation. We addressed the problem for the case when applica-

tion classes specify their QoS requirement as an average response time goal. A useful

extension to this work would be where application classes could specify their QoS

requirement in dissimilar ways. An additional enhancementwould involve under-

standing and developing a way of identifying a QoS specification which is suitable

and realistic for each class and for the given storage system.� Automated Object Remapping (extensions):In the prototype for our automated

object remapping work we made the simplifying assumption that all arrays in the

storage system are similar. As future work we would like to develop techniques for

148

quantifying the bandwidth requirements of objects, as wellas the bandwidth capaci-

ties of arrays, for a storage system comprising heterogeneous arrays.� Distributed Resource Management:Traditionally storage systems have been ei-

ther NAS-based (Network-attached Storage) or SAN-based (Storage Area Network).

While NAS offers ease of management, a SAN offers high throughput. An object-

based storage architecture[41] offers a middle-ground between the NAS and SAN

architectures, suitably blending the advantages of both. Active disks[4] tout the

benefits of moving computation closer to the data. Recent work [32] explores the

confluence of these two paradigms and presents techniques for leveraging the com-

putational capability at the storage device for interactive search of indexed data. In

this thesis, we focused on management of the storage resource. In the context of

active disks there are interesting problems in distributedresource management for

multiple resources. In particular, finding the right balance between computing at the

device and computing at the host with a knowledge of the network interconnect and

bandwidth capacity of the storage device would be key. Moreover, such an approach

should be self-managing, identifying the right tradeoff for diverse applications.

149

APPENDIX

COMPARISON USING HOMOGENEOUS WORKLOADS

In this appendix we present the detailed results of our homogeneous workload simula-

tions experiments. We experiment with large requests that have a mean request size of 1

MB and a stripe unit size of 512KB. We repeat each experiment with small requests that

have a mean size of 4KB and a stripe unit size of 4KB. Unless specified otherwise, we

choose request rates that yield a utilization of around 60-65%; this corresponds to a mean

inter-arrival time of 17 ms for large requests and 4 ms for small requests, respectively.

Effect of System Size:We vary the number of arrays in the system from 1 to 10 and

measure the response times of requests in the narrow and widestriped system. Each array

in the system is accessed by a single stream in narrow striping and all streams access all

arrays in wide striping. Figure A.1 plots the results.

The figure shows that the performance of the two systems is similar over a range of

system sizes for both large and small requests. Increasing the system size results in inter-

ference between streams in wide striping since all stores span all arrays. However, since all

stores span all arrays, this also leads to better load balancing across arrays. As we increase

the system size, the benefits of load balancing balance the impact of interference, and the

response times remain almost unchanged.

Effect of Stripe Unit Size: In this experiment we study the impact of changing the

stripe unit size. Varying the stripe unit size of small requests did not have much impact, so

we omit the results. The stripe unit size of large requests was varied from 128 KB to 2 MB.

The average request size of the large requests was kept fixed at 1 MB. Figure A.2 plots the

results.

150

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10

Mea

n R

espo

nse

Tim

e (

ms)


Large Requests


0

5

10

15

20

1 2 3 4 5 6 7 8 9 10

Mea

n R

espo

nse

Tim

e (

ms)


Small Requests



Figure A.1. Homogeneous Workload: Effect of System Size

For large requests, when the stripe unit size is smaller as compared to the average re-

quest size, wide-striping gives higher response times as compared to narrow striping. This

is because, although a smaller stripe unit size results in increased parallelism, it also in-

creases the sequentiality breakdown and the probability ofinterference with requests from

streams accessing other stores. To wit, an average request size of 1 MB would result in 8

disk accesses for a stripe unit size of 128 KB, as compared to 2disk accesses for a stripe

unit size of 512 KB. The sequentiality of access is maintained in narrow-striping since all

requests for a store access the same array. An increase in thestripe unit size reduces the

extent of sequentiality breakdown, and narrow and wide striping give comparable perfor-

mance.

Effect of Utilization Level: In this experiment, we study the impact of utilization level

by varying the mean inter-arrival times (IA) of requests. The IA time for large (small)

requests is varied from 14 ms to 20 ms (3ms to 7ms) in steps of 1 ms. Figure A.3 shows

the results for the large and the small case, respectively.

Figure A.3 (a) shows that for large requests, as one decreases the IA times the relative

performance of narrow striping improves slightly. This is because, at low IA times the

request rate is higher, and streams see increased interference from other streams in wide

151

0

50

100

150

200

128 256 512 1024 2048

Mea

n R

espo

nse

Tim

e (

ms)


Large Requests



Figure A.2. Homogeneous Workload: Effect of Stripe-unit Size

striping. For larger IA times narrow and wide striping give comparable performance. Vary-

ing the IA times for smaller requests results in similar behavior (see Figure A.3 (b)); the

difference in response times between narrow and wide striping in this case however, are

smaller than that observed for large requests, because of the smaller transfer time of small

requests.

0

20

40

60

80

100

120

11 12 13 14 15 16 17 18 19 20

Mea

n R

espo

nse

Tim

e (

ms)


Large Requests



0

5

10

15

20

25

3 3.5 4 4.5 5 5.5 6 6.5 7

Mea

n R

espo

nse

Tim

e (

ms)


Small Requests




Figure A.3. Homogeneous Workload: Effect of Utilization Level

Effect of Request Size:Next we study the effect of changing the request size. Varying

the request size of small requests did not have much impact sowe omit the results. The

152

0

50

100

150

200

250

300

350

64 128 256 512 1024 2048

Mea

n R

espo

nse

Tim

e (

ms)


Large Requests



Figure A.4. Homogeneous Workload: Effect of Request Size

request size of large requests is varied from 64 KB to 128 KB. The stripe unit size was

chosen to be half the average request size. Figure A.4 plots the results. Narrow and wide

striping give similar performance for most request sizes. For very large request sizes (2

MB), the interference between request streams results in wide striping giving slightly larger

response times.

Effect of Percentage of Writes:In this experiment we study the effect of varying the

percentage of writes. The percentage of writes was varied from 0 % to 90 %. We chose

inter-arrival times of 20 ms and 6 ms for large and small requests, respectively. Figure A.5

plots the results.

For large requests (see Figure A.5 (a)) we observe that as we increase the percentage

of writes the performance difference between narrow and wide striping increases, with

wide-striping giving higher response times. This is because, increasing the percentage

of writes, increases the background load due to dirty cache flushes, which increases the

interference seen by request streams in wide striping. Small requests (see Figure A.5 (b))

observe similar behavior; the impact of interference from background load due to dirty

cache flushes however, is less pronounced, due to the smallersize of requests.

153

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80 90

Mea

n R

espo

nse

Tim

e (

ms)


Large Requests



0

5

10

15

20

25

0 10 20 30 40 50 60 70 80 90

Mea

n R

espo

nse

Tim

e (

ms)


Small Requests




Figure A.5. Homogeneous Workload: Effect of Percentage of Writes

154

BIBLIOGRAPHY

[1] Configuring the oracle database with veritas software and emcstorage. Tech. rep., Oracle Corporation. Available fromhttp://otn.oracle.com/deploy/availability/pdf/oracbook1.pdf.

[2] Emc symmetrix optimizer. Available from http://www.emc.com/products/storagemanagement/symmoptimizer.jsp.

[3] Abdelzaher, T., Shin, K. G., and Bhatti, N. Performance guarantees for web serverend-systems: A control-theoretical approach.IEEE Transactions on Parallel andDistributed Systems 13, 1 (Jan. 2002).

[4] Acharya, Anurag, Uysal, Mustafa, and Saltz, Joel H. Active disks: Programmingmodel, algorithms and evaluation. InArchitectural Support for Programming Lan-guages and Operating Systems(1998), pp. 81–91.

[5] Agrawal, S., Chaudhuri, S., Das, A., and Narasayya, V. Automating layout of rela-tional databases. InProceedings of the 19th International Conference on Data Engi-neering, Bangalore, India(2003).

[6] Allen, N. Don’t waste your storage dollars. Research Report, Gartner Group, March2001.

[7] Alvarez, G., Borowsky, E., Go, S., Romer, T., Becker-Szendy, R., Golding, R., Mer-chant, A., Spasojevic, M., Veitch, A., and Wilkes, J. Minerva: An automated resourceprovisioning tool for large-scale storage systems.ACM Transactions on ComputerSystems (to appear)(2002).

[8] Alvarez, G., Keeton, K, Merchant, A, Riedel, E, and Wilkes, J. Storage systemsmanagement. Tutorial presented at ACM Sigmetrics 2000, Santa Clara, CA, June2000.

[9] Anderson, E., Hobbs, M., Keeton, K., Spence, S., Uysal, M., and Veitch, A. Hippo-drome: Running circles around storage administration. InProceedings of the UsenixConference on File and Storage Technology (FAST’02), Monterey, CA(January 2002),pp. 175–188.

[10] Anderson, E., Kallahalla, M., Spence, S., Swaminathan, R., and Wang, Q. Ergastu-lum: An approach to solving the workload and device configuration problem. Tech.Rep. HPL-SSP-2001-05, HP Laboratories SSP, May 2001.

155

[11] Anderson, E., Swaminathan, R., Veitch, A., Alvarez, G., and Wilkes, J. Selectingraid levels for disk arrays. InProceedings of the Conference on File and StorageTechnology (FAST’02), Monterey, CA(January 2002), pp. 189–201.

[12] Aron, M., Sanders, D., Druschel, P., and Zwaenepoel, W.Scalable content-awarerequest distribution in cluster-based network servers. InProceedings of the USENIX2000 Annual Technical Conference, San Diego, CA(June 2000).

[13] Barham, P. A fresh approach to file system quality of service. In Proceedings ofNOSSDAV’97, St. Louis, Missouri(May 1997), pp. 119–128.

[14] Borowsky, E., Golding, R., Jacobson, P., Merchant, A.,Schreier, L., Spasojevic,M., and Wilkes, J. Capacity planning with phased workloads.In Proceedings ofWOSP’98, Santa Fe, NM(October 1998).

[15] Borowsky, E., Golding, R., Merchant, A., Shriver, E., Spasojevic, M., and Wilkes,J. Eliminating storage headaches through self-management. In Proc. of the FirstSymposium on Operating System Design and Implementation (OSDI), Seattle, WA(October 1996).

[16] Breslau, L., Cao, P., Fan, L., Phillips, G., and Shenker, S. Web caching and zipf-likedistributions: Evidence and implications. InProceedings of Infocom’99, New York,NY (March 1999).

[17] Brown, A., Oppenheimer, D., Keeton, K., Thomas, R., Kubiatowicz, J., and Patterson,D.A. Istore: Introspective storage for data-intensive network services. InProceedingsof the 7th Workshop on Hot Topics in Operating Systems (HotOS-VII), Rio Rico, AZ(March 1999).

[18] Brown, A., and Patterson, D. A. Towards maintainability, availability, and growthbenchmarks: A case study of software raid systems. InProceedings of the USENIXAnnual Technical Conference, San Diego, CA(June 2000).

[19] Chase, J., Anderson, D., Thakar, P., Vahdat, A., and Doyle, R. Managing energy andserver resources in hosting centers. InProceedings of the Eighteenth ACM Symposiumon Operating Systems Principles (SOSP)(October 2001), pp. 103–116.

[20] Chen, P., and Patterson, D. Maximizing performance in astriped disk array. InProceedings of ACM SIGARCH Conference on Computer Architecture, Seattle, WA(May 1990), pp. 322–331.

[21] Chen, P. M., and Lee, E. K. Striping in a raid level 5 disk array. InProceedings of the1995 ACM SIGMETRICS Conference on Measurement and Modelingof ComputerSystems(May 1995).

[22] Chesire, M., Wolman, A, Voelker, G., and Levy, H. Measurement and analysis of astreaming workload. InProceedings of the USENIX Symposium on Internet Technol-ogy and Systems (USEITS), San Francisco, CA(March 2001).

156

[23] Dahlin, M., Mather, C., Wang, R., Anderson, T., and Patterson, D. A quantitativeanalysis of cache policies for scalable network file systems. In Proceedings of ACMSIGMETRICS’94(May 1994).

[24] Dan, A., and Sitaram, D. An online video placement policy based on bandwidth andspace ratio. InProceedings of SIGMOD(May 1995), pp. 376–385.

[25] et al, Eric Anderson. An experimental study of data migration algorithms. InWAE:International Workshop on Algorithm Engineering(2001), LNCS.

[26] Flynn, R., and Tetzlaff, W. H. Disk striping and block replication algorithms forvideo file servers. InProceedings of IEEE International Conference on MultimediaComputing Systems (ICMCS)(1996), pp. 590–597.

[27] Gribble, S. D., Manku, G., Roselli, D., Brewer, E., Gibson, T., and Miller, E. Self-similarity in file systems. InProceedings of ACM SIGMETRICS ’98, Madison, WI(June 1998).

[28] Hall, Joseph, Hartline, Jason D., Karlin, Anna R., Saia, Jared, and Wilkes, John. Onalgorithms for efficient data migration. InSymposium on Discrete Algorithms(2001),pp. 620–629.

[29] Haskin, R. Tiger shark–a scalable file system for multimedia. IBM Journal of Re-search and Development 42, 2 (March 1998), 185–197.

[30] Hennessy, J. The future of systems research.IEEE Computer(August 1999), 27–33.

[31] Holton, M., and Das, R. XFS: A next generation journalled 64-bit file systemwith guaranteed rate i/o. Tech. rep., Silicon Graphics, Inc, Available online ashttp://www.sgi.com/Technology/xfs-whitepaper.html, 1996.

[32] Huston, Larry, Sukthankar, Rahul, Wickremesinghe, Rajiv, Satyanarayanan, Ma-hadev, Ganger, Gregory R., Riedel, Erik, and Ailamaki, Anastassia. Diamond: Astorage architecture for early discard in interactive search. In FAST(2004), pp. 73–86.

[33] Karlsson, Magnus, Karamanolis, Christos, and Zhu, Xiaoyun. Triage: Performanceisolation and differentiation for storage systems. InProceedings of the Interna-tional Workshop on Quality of Service (IWQoS 2004), Montreal, Canada(June 2004),pp. 67–74.

[34] Keeton, K., Patterson, D A., and Hellerstein, J. The case for intelligent disks (idisks).In Proceedings of the 24th Conference on Very Large Databases (VLDB) (August1998).

[35] Khuller, S., Kim, Y., and Wan, Y. Algorithms for data migration with cloning. InACM Symp. on Principles of Database Systems (2003).(2003).

[36] Lamb, E. Hardware spending matters.Red Herring(June 2001), 32–22.

157

[37] Lee, E.K., and Katz, R.H. An analytic performance modelfor disk arrays. InPro-ceedings of the 1993 ACM SIGMETRICS(May 1993), pp. 98–109.

[38] Loaiza, J. Optimal storage configuration made easy. Tech. rep., Oracle Corporation.Available from http://otn.oracle.com/deploy/performance/pdf/optstorageconf.pdf.

[39] Lu, C., Alvarez, G., and Wilkes, J. Aqueduct: Online data migration with perfor-mance guarantees. InProceedings of the Usenix Conference on File and StorageTechnology (FAST’02), Monterey, CA(January 2002), pp. 219–230.

[40] Lumb, C., Merchant, A., and Alvarez, G. Facade: Virtualstorage devices with per-formance guarantees. InFAST’03(2003).

[41] Mesnier, M., Ganger, G., and Riedel, E. Object-based storage.IEEE CommunicationsMagazine 41, 8 (August 2003), 84–90.

[42] Molano, A., Juvva, K., and Rajkumar, R. Real-time file systems: Guaranteeing timingconstraints for disk accesses in rt-mach. InProceedings of IEEE Real-time SystemsSymposium(December 1997).

[43] Nerjes, G., Muth, P., Paterakis, M., Romboyannakis, Y., Triantafillou, P., andWeikum, G. Scheduling strategies for mixed workloads in multimedia informationservers. InProceedings of the 8th International Workshop on Research Issues in DataEngineering (RIDE’98), Orlando, Florida(February 1998).

[44] Nordstrom, E., and Carlstrom, J. A reinforcement learning scheme for adaptive linkallocation in atm networks. InProceedings of the International Workshop on Appli-cations of Neural Networks to Telecommunications 2, IWANN’T 95(1995).

[45] Patterson, D., Gibson, G., and Katz, R. A case for redundant array of inexpensivedisks (raid). InProceedings of ACM SIGMOD’88(June 1988), pp. 109–116.

[46] Patterson, D.A., Brown, A., Broadwell, P., Candea, G.,Chen, M., Cutler, J., Enriquez,P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff,W., J.Traupman, and Treuhaft, N. Recovery-oriented computing (roc): Motivation,definition, techniques, and case studies. InUC Berkeley Computer Science TechnicalReport UCB//CSD-02-1175(March 2002).

[47] Pradhan, P., Tewari, R., Sahu, S., Chandra, A., and Shenoy, P. An observation-based approach towards self-managing web servers. InProceedings of ACM/IEEEIntl Workshop on Quality of Service (IWQoS), Miami Beach, FL(May 2002).

[48] Revel, D., McNamee, D., Pu, C., Steere, D., and Walpole,J. Feedback based dynamicproportion allocation for disk i/o. Tech. Rep. CSE-99-001,OGI CSE, January 1999.

[49] Riedel, E., Gibson, G A., and Faloutsos, C. Active storage for large-scale data miningand multimedia. InProceedings of the 24th international Conference on Very LargeDatabases (VLDB ’98), New York, NY(August 1998).

158

[50] Scheuermann, P., Weikum, G., and Zabback, P. Data partitioning and load balancingin parallel disk systems.VLDB Journal 7, 1 (1998), 48–66.

[51] Scheuermann, Peter, Weikum, Gerhard, and Zabback, Peter. Data partitioning andload balancing in parallel disk systems.VLDB Journal: Very Large Data Bases 7, 1(1998), 48–66.

[52] Seltzer, M., and Small, C. Self-monitoring and self-adapting systems. InProceed-ings of the 1997 Workshop on Hot Topics on Operating Systems,Chatham, MA(May1997).

[53] Shenoy, P., Goyal, P., and Vin, H M. Architectural considerations for next generationfile systems. InProceedings of the Seventh ACM Multimedia Conference, Orlando,FL (November 1999).

[54] Shenoy, P, and Vin, H M. Cello: A disk scheduling framework for next generationoperating systems. InProceedings of ACM SIGMETRICS Conference, Madison, WI(June 1998), pp. 44–55.

[55] Singh, S., and Bertsekas, D. Reinforcement learning for dynamic channel allocationin cellular telephone systems. InAdvances in Neural Information Processing Systems9 (NIPS)(1997), pp. 974–980.

[56] Sundaram, V., and Shenoy, P. Bandwidth allocation in a self-managing multimediafile server. InProceedings of the Ninth ACM Conference on Multimedia, Ottawa,Canada(October 2001).

[57] Sundaram, V., and Shenoy, P. A practical learning-based approach for dynamic stor-age bandwidth allocation. InProceedings of ACM/IEEE Intl Workshop on Quality ofService (IWQoS), Monterey, CA(June 2003), pp. 479–497.

[58] Sutton, R S., and Barto, A G.Reinforcement Learning: An Introduction. MIT Press,Cambridge, MA.

[59] Ward, J., O’Sullivan, M., Shahoumian, T., and Wilkes, J. Hippodrome: Runningcircles around storage administration. InProceedings of the Usenix Conference onFile and Storage Technology (FAST’02), Monterey, CA(January 2002), pp. 203–217.

[60] Wijayaratne, R., and Reddy, A. L. N. Providing qos guarantees for disk i/o. Tech. Rep.TAMU-ECE97-02, Department of Electrical Engineering, Texas A&M University,1997.

[61] Wilkes, J., Golding, R., Staelin, C., and Sullivan, T. The hp autoraid hierarchicalstorage system. InProceedings of the Fifteenth ACM Symposium on Operating SystemPrinciples, Copper Mountain Resort, Colorado(Decmember 1995), pp. 96–108.

[62] Wolf, J., Yu, P. S., and Shachnai, H. Dasd dancing- a diskload balancing optimizationscheme for video-on-demand computer systems. InProceedings of ACM SIGMET-RICS’95(1995), pp. 157–166.

159

SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler...

Documents

Transcript of SELF-MANAGING TECHNIQUES FOR STORAGE RESOURCE …lass.cs.umass.edu/theses/vijay.pdf · Tyler...