TAG MANAGEMENT ARCHITECTURE AND...

TAG MANAGEMENT ARCHITECTURE AND POLICIES FOR HARDWARE-MANAGEDTRANSLATION LOOKASIDE BUFFERS IN VIRTUALIZED PLATFORMS

By

GIRISH VENKATASUBRAMANIAN

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2011

c⃝ 2011 Girish Venkatasubramanian

2

ACKNOWLEDGMENTS

My heartfelt gratitude and thanks are due to my advisor Dr. Renato J. Figueiredo

for supporting, encouraging and guiding me in my academic journey culminating in the

PhD degree. His patience and guidance, especially during the initial years, gave me the

confidence to persevere. Learning from him about computer architecture and systems,

virtualization, the art of research, techniques for good writing and strategies for creating

good presentations has been a wonderful experience. I am privileged to have him as my

advisor and mentor.

I thank Dr. P. Oscar Boykin for teaching me techniques of analytical modeling and

for the invigorating discussions on applying engineering principles to solve real-world

problems. I am grateful to Dr. Jose Fortes for giving me an opportunity to be a part of

the ACIS Lab at the University of Florida and for sharing his insight and perspective

on research and the PhD process. I also thank Dr. Tao Li and Dr. Prabhat Mishra for

serving on my committee and for their insightful questions and suggestions which have

enhanced this dissertation.

A good portion of my computer architecture knowledge and simulation skills were

learned and honed during my internships at Intel Corporation. I thank Ramesh Illikkal,

Greg Regnier, Donald Newell and Dr. Ravi Iyer for giving me these opportunities and

Nilesh Jain, Jaideep Moses, Dr. Omesh Tickoo and Paul M.Stillwell Jr for helping me

complete these internships successfully. I also thank the members of the SoC Platform

and Architecture group at Intel Labs for their ideas and perspectives on my research.

I am especially thankful to Dr. Omesh Tickoo for being a wonderful mentor during and

after my internship.

I would also like to thank my past and present colleagues at ACIS Labs and at

University of Florida including Priya Bhat, Dr. Vineet Chadha, Dr. Arijit Ganguly, Dr.

Clay Hughes, Selvi Kadirvel, Dr. Andrea Matsunaga, Dr. James M. Poe II, Prapaporn

Rattanatamrong, Pierre St. Juste, Dr. Mauricio Tsuagawa and David Wolinsky for their

3

help and feedback on my work and for the many intellectual discussions on computer

architecture, computer networks, modeling and simulation. This work was funded in

part by the National Science Foundation under CRI collaborative awards 0751112,

0750847, 0750851, 0750852, 0750860, 0750868, 0750884, and 0751091 and by a

grant from Intel Corporation. I would also like to acknowledge the University of Florida

High-Performance Computing Center for computation resources. I also thank Virtutech

for their support in using Simics and Naveen Neelakantam from the University of Illinois

at Urbana-Champaign for his help with using FeS2.

My motivation to obtain a PhD was inspired by my parents, Dr. N. K. Venkatasubramanian

and Prabhavathy Venkatasubramanian, and my uncle Vaidyanathan. They, along with

my sister Dr. Chitra Venkatasubramanian and my brother-in-law Murthy S. Krishna, have

been a source of encouragement and support without which this dissertation would not

have been completed. I thank them and dedicate this dissertation to them.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1 Hardware-Managed TLBs in Virtualized Environments . . . . . . . . . . . 141.2 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 15

1.2.1 Simulation-Based Analysis of the TLB Performance on VirtualizedPlatforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.2.2 Tag Manager Table for Process-Specific Tagging of the TLB . . . . 171.2.3 Mechanisms and Policies for TLB Usage Control . . . . . . . . . . 18

1.3 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 BACKGROUND: VIRTUAL MEMORY AND PLATFORM VIRTUALIZATION . . 21

2.1 Virtual Memory in Non-Virtualized Systems . . . . . . . . . . . . . . . . . 222.1.1 Implementing Virtual Memory Using Paging . . . . . . . . . . . . . 232.1.2 Address Translation in x86 with Page Address Extension Enabled . 24

2.2 Translation Lookaside Buffer . . . . . . . . . . . . . . . . . . . . . . . . . 262.3 Virtual Memory in Virtualized Systems . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Full-System Virtualization and Shadow Page Tables . . . . . . . . 292.3.2 Paravirtualization and Page Tables . . . . . . . . . . . . . . . . . . 302.3.3 Hardware Virtualization and Two-Level Page Tables . . . . . . . . . 31

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 A SIMULATION FRAMEWORK FOR THE ANALYSIS OF TLB PERFORMANCE 34

3.1 Survey of Simulation Frameworks Used in TLB-Related Research . . . . 353.2 Developing the Simulation Framework . . . . . . . . . . . . . . . . . . . . 36

3.2.1 Using Simics and FeS2 as Foundation . . . . . . . . . . . . . . . . 373.2.2 TLB Functional Model . . . . . . . . . . . . . . . . . . . . . . . . . 383.2.3 Validation of the TLB Functional Model . . . . . . . . . . . . . . . . 393.2.4 TLB Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.5 Validating the TLB Timing Model . . . . . . . . . . . . . . . . . . . 42

3.3 Selection and Preparation of Workloads . . . . . . . . . . . . . . . . . . . 453.3.1 Workload Applications . . . . . . . . . . . . . . . . . . . . . . . . . 463.3.2 Consolidated Workloads . . . . . . . . . . . . . . . . . . . . . . . . 463.3.3 Multiprocessor Workloads . . . . . . . . . . . . . . . . . . . . . . . 47

5

3.3.4 Checkpointing Workloads . . . . . . . . . . . . . . . . . . . . . . . 483.4 Evaluation of the Simulation Framework . . . . . . . . . . . . . . . . . . . 483.5 Using the Framework to Investigate TLB Behavior in Virtualized Platforms 51

3.5.1 Increase in TLB Flushes on Virtualization . . . . . . . . . . . . . . 533.5.2 Increase in TLB Miss Rate on Virtualization . . . . . . . . . . . . . 543.5.3 Decrease in Workload Performance on Virtualization . . . . . . . . 56

3.5.3.1 I/O-intensive workloads . . . . . . . . . . . . . . . . . . . 573.5.3.2 Memory-intensive workloads . . . . . . . . . . . . . . . . 603.5.3.3 Consolidated workloads . . . . . . . . . . . . . . . . . . . 61

3.5.4 Impact of Architectural Parameters on TLB Performance . . . . . . 633.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4 A TLB TAG MANAGEMENT FRAMEWORK FOR VIRTUALIZED PLATFORMS 66

4.1 Current State of the Art in Improving TLB Performance . . . . . . . . . . . 664.2 Architecture of the Tag Manager Table . . . . . . . . . . . . . . . . . . . . 68

4.2.1 Avoiding Flushes Using the Tag Manager Table . . . . . . . . . . . 704.2.2 TLB Lookup and Miss Handling Using the Tag Manager Table . . . 72

4.3 Modeling the Tag Manager Table . . . . . . . . . . . . . . . . . . . . . . . 744.4 Impact of the Tag Manager Table . . . . . . . . . . . . . . . . . . . . . . . 74

4.4.1 Reduction in TLB Flushes Due to the TMT . . . . . . . . . . . . . . 744.4.2 Reduction in TLB Miss Rate Due to the TMT . . . . . . . . . . . . . 794.4.3 Increase in Workload Performance Due to the TMT . . . . . . . . . 82

4.5 Architectural and Workload Parameters Affecting the Impact of the TMT . 884.5.1 Architectural Parameters . . . . . . . . . . . . . . . . . . . . . . . . 884.5.2 Workload Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.5.2.1 Effect of larger memory footprint . . . . . . . . . . . . . . 894.5.2.2 Effect of the number of processes in the workload . . . . 91

4.5.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.6 Comparison of Process-Specific and Domain-Specific Tags . . . . . . . . 964.7 Using the Tag Manager Table on Non-Virtualized Platforms . . . . . . . . 974.8 Enabling Shared Last Level TLBs Using the Tag Manager Table . . . . . . 99

4.8.1 Using the TMT as the Tagging Framework . . . . . . . . . . . . . . 1004.8.2 Architecture of the Shared LLTLB . . . . . . . . . . . . . . . . . . . 1014.8.3 Miss Rate Improvement Due to Shared Last Level TLBs . . . . . . 104

4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5 CONTROLLED SHARING OF HARDWARE-MANAGED TLB . . . . . . . . . . 107

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.2 Architecture of the CShare TLB . . . . . . . . . . . . . . . . . . . . . . . . 1115.3 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.4 Performance Isolation using CShare Architecture . . . . . . . . . . . . . . 1155.5 Performance Enhancement Using CShare Architecture . . . . . . . . . . 119

5.5.1 Classification of TLB Usage Patterns . . . . . . . . . . . . . . . . . 1195.5.2 Performance Improvement With Static TLB Usage Control . . . . . 122

6

5.5.3 Selective Performance Improvement With Static TLB Usage Control1275.5.4 Performance Improvement With Dynamic TLB Usage Control . . . 131

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . 136

APPENDIX

A FULL FACTORIAL EXPERIMENT . . . . . . . . . . . . . . . . . . . . . . . . . 139

B FULL FACTORIAL EXPERIMENTS USING THE SIMULATION FRAMEWORK 141

C USING THE TAG MANAGER TABLE FOR TAGGING I/O TLB . . . . . . . . . . 142

C.1 Architecture of VMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143C.2 Prototyping and Simulating the VMA Architecture . . . . . . . . . . . . . . 144C.3 Using the Tag Manager Table in VMA Architecture . . . . . . . . . . . . . 151C.4 Functional Verification of the Use of TMT in VMA . . . . . . . . . . . . . . 152C.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

7

LIST OF TABLES

Table page

3-1 Pseudocode of the micro benchmark for TLB timing model validation . . . . . . 43

3-2 Throughput of the simulation framework for multiprocessor x86 simulations . . 52

3-3 Simulation parameters for investigating TLB behavior on virtualized platforms . 53

3-4 Impact of Page Walk Latency on TLB-induced performance reduction RIPC . . 63

4-1 Flush profile for SPECjbb-based workloads with varying heap sizes . . . . . . 90

4-2 Flush Profile for TPCC-UVa based workloads with varying number of processesand varying TMT sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4-3 Factors and their levels for the sensitivity analysis . . . . . . . . . . . . . . . . . 95

4-4 Factors with significant influence on the Reduction in TLB miss rates due toCR3 tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5-1 Algorithms for selection of victim SID . . . . . . . . . . . . . . . . . . . . . . . . 114

8

LIST OF FIGURES

Figure page

2-1 Page walk for a 4KB page with PAE enabled . . . . . . . . . . . . . . . . . . . . 26

2-2 Translation Lookaside Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2-3 Memory virtualization in a virtualized platform . . . . . . . . . . . . . . . . . . . 29

3-1 Simulation framework for analyzing TLB performance . . . . . . . . . . . . . . 38

3-2 Timing flow in the simulation framework . . . . . . . . . . . . . . . . . . . . . . 40

3-3 Validation of the TLB timing model . . . . . . . . . . . . . . . . . . . . . . . . . 44

3-4 Screenshot of the simulation framework in use . . . . . . . . . . . . . . . . . . 49

3-5 Throughput of the simulation framework for uniprocessor x86 simulations . . . 50

3-6 Increase in TLB flushes on virtualization . . . . . . . . . . . . . . . . . . . . . . 54

3-7 Increase in TLB miss rate on virtualization . . . . . . . . . . . . . . . . . . . . . 55

3-8 Decrease in single-domain workload performance on virtualization . . . . . . . 58

3-9 Decrease in consolidated workload performance on virtualization . . . . . . . . 62

3-10 Impact of the pipeline fetch width (FW) on TLB-induced performance reduction 64

4-1 TLB flush behavior with the Tag Manager Table . . . . . . . . . . . . . . . . . . 70

4-2 TLB lookup behavior with the Tag Manager Table . . . . . . . . . . . . . . . . . 72

4-3 Reduction in TLB flushes using an 8-entry TMT . . . . . . . . . . . . . . . . . . 75

4-4 Effect of Tag Manager Table size on the reduction in number of flushes . . . . . 78

4-5 Reduction in TLB miss rate using an 8-entry TMT . . . . . . . . . . . . . . . . . 80

4-6 Effect of TLB associativity on the reduction in miss rate . . . . . . . . . . . . . 82

4-7 Increase in workload performance using an 8-entry TMT . . . . . . . . . . . . . 85

4-8 Effect of the Page Walk Latency on the improvement in performance . . . . . . 87

4-9 Effect of workload memory footprint on the reduction in TLB miss rate . . . . . 91

4-10 Effect of the number of workload processes on the reduction in ITLB miss rate 94

4-11 Comparison of the performance improvement due to process-specific andVM-specific tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4-12 Performance impact of TMT on non-virtualized platforms . . . . . . . . . . . . . 98

9

4-13 Using the TMT for Shared Last Level TLBs . . . . . . . . . . . . . . . . . . . . 102

4-14 Reduction in DTLB miss rate due to Shared Last Level TLB . . . . . . . . . . . 105

5-1 Performance improvement for consolidated workloads with uncontrolled TLBsharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5-2 Controlled TLB usage using CShare architecture . . . . . . . . . . . . . . . . . 112

5-3 Effect of varying TLB reservation on miss rate . . . . . . . . . . . . . . . . . . . 117

5-4 Miss rate isolation using the TMT architecture . . . . . . . . . . . . . . . . . . . 118

5-5 Classification of TLB usage patterns . . . . . . . . . . . . . . . . . . . . . . . . 121

5-6 Overall miss rate improvement for consolidated workload with static TLB usagecontrol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5-7 Overall performance improvement for consolidated workload with static TLBusage control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5-8 Selective performance improvement for consolidated workload with static TLBusage control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5-9 Dynamic TLB Usage Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5-10 Selective performance improvement for consolidated workload with dynamicTLB usage control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

C-1 Architecture and simulation-based prototype of VMA . . . . . . . . . . . . . . . 145

C-2 IPMMU and I/O TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

C-3 Functional validation of the use of TMT in VMA . . . . . . . . . . . . . . . . . . 152

10

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

TAG MANAGEMENT ARCHITECTURE AND POLICIES FOR HARDWARE-MANAGEDTRANSLATION LOOKASIDE BUFFERS IN VIRTUALIZED PLATFORMS

By

Girish Venkatasubramanian

August 2011

Chair: Renato J. FigueiredoMajor: Electrical and Computer Engineering

The use of virtualization to effectively harness the power of multi-core processors

has emerged as a viable solution to meet the growing demand for computing resources,

especially in the server segment of the computing industry. However, two significant

issues in using virtualization for performance-critical workloads are: 1. the overhead of

virtualization, which adversely impacts the performance of such virtualized workloads,

and 2. the ”noise” or variation in the performance of these virtualized workloads due to

the platform resources being shared amongst multiple virtual machines (VMs) . Thus,

improving the performance of virtualized workloads and reducing the performance

variations introduced by the sharing of platform resources are two challenges in

the field of virtualization. Meeting these challenges, specifically in the context of

hardware-managed Translation Lookaside Buffers (TLBs), forms the theme of this

dissertation.

To understand the performance impact of the TLB and to investigate the performance

improvement due to various architectural modifications, a suitable simulation framework

is imperative. Hence, the first contribution of this dissertation is developing a full-system

execution-driven simulation framework supporting the x86 ISA and detailed TLB

functional and timing models. Using this framework, it is observed that the performance

of typical server workloads are reduced by as much as 8% to 35% due to the TLB

misses on virtualized platforms, compared to the 1% to 5% reduction on non-virtualized

11

single-O/S platforms. This clearly motivates the need for improving the TLB performance

for virtualized workloads.

The second part of this dissertation proposes the Tag Manager Table (TMT)

for generating and managing process-specific tags for hardware-managed TLBs,

in a software-transparent manner. By tagging the TLB entries with process-specific

identifiers, multiple processes can share the TLB, thereby avoiding TLB flushes that are

triggered during context switches. Using the TMT reduces the TLB miss rates by 65%

to 90% and the TLB-induced delay by 50% to 80% compared to a TLB without tags,

thereby improving workload performance by 4.5% to 25%. The effect of various factors

including the TLB and TMT design parameters, the workload characteristics and the

TLB miss penalty on the benefit of using the TMT is explored. The use of the TMT in

enabling shared Last Level TLBs is also investigated. Furthermore, the use of the TMT

to tag I/O TLBs, in scenarios where address translation services and TLBs in the I/O

fabric allow I/O devices to operate in virtual address space, is also explored.

While the TMT enables multiple processes to share a TLB, this results in the

TLB becoming a potential resource of contention. The third part of this dissertation

investigates the performance implications of such TLB contention and proposes the

CShare TLB architecture to isolate the TLB behavior of virtualized workloads from one

another using a TLB Sharing Table (TST) along with the TMT. The use of the CShare

TLB in increasing the overall performance of consolidated workloads involving streaming

applications with poor TLB usage as well as in selectively increasing the performance

of a high priority workload by restricting the TLB usage of low priority workloads is

explored. It is observed that the increase in the performance of a high priority workload

due to using the TMT without controlled sharing can be further improved by 1.4× using

such TLB usage restrictions. The use of dynamic usage control policies to achieve this

selective performance increase while minimizing the performance reduction of the low

priority workloads is also investigated.

12

CHAPTER 1INTRODUCTION

The current paradigm of computing in the server industry is undergoing rapid

changes. On one hand, the demand for computing resources has been growing,

especially in the server segment. This growth is driven by the expansion of online

service providers including cloud computing and social networking services, in addition

to the traditional server-oriented high-performance computing and banking sectors.

Facebook, a major social network provider, has increased the number of servers it uses

from 10, 000 to 30, 000 in 2009 [1]. Cloud service providers including Amazon EC2,

Rackspace and GoGrid have experienced increasing computing requirements in the

past year [2].

On the other hand, Chip Multi Processor (CMP) architectures, with an ever

increasing number of processors on a single die, have emerged as the architectural

solution for powerful servers [3]. Processors with 8 hardware threads are already being

used and 16 thread processors have been demonstrated [4].

Virtualization has emerged as one of the key technologies allowing to tap the

power of CMPs to meet the computing demands of the server segment in a flexible

manner [5]. By encapsulating applications with their Operating System (O/S) and

software stack in virtual machines (VMs), multiple applications can be consolidated on

a single physical platform. Moreover, with the rising emphasis on ”green” server rooms

and low-cost autonomic management, virtualization has emerged as a convenient way

to manage Quality of Service (QoS) and resource sharing among the consolidated

applications. Virtualization is also being explored for ensuring application portability in

High Performance Computing (HPC) systems by virtualizing different HPC systems with

disparate architectures into a standardized platform abstraction [6, 7].

Estimates by Gartner [8, 9], predicting that Hosted Virtual Desktop market will

surpass $65 billion in 2013 and the Software as a Service (SaaS) model using

13

virtualization will account for 20% of email services by 2012, clearly highlight the

importance of virtualization in the server domain. Similarly, the recent virtualization

of Sandia National Lab’s Red Storm supercomputer using a specially designed

hypervisor [10] is a testament to the applicability of virtualization in HPC domains.

However, the benefit of using virtualization for performance critical server and HPC

applications is accompanied by two significant challenges.

1.1 Hardware-Managed TLBs in Virtualized Environments

Full-system virtualization may be viewed as providing an environment where

multiple applications, each belonging to different users and having different requirements

(such as the software stack on which it runs), can coexist [11]. Typically, the application

running in a VM is not aware of this fact and behaves in exactly the same way as it

would on a real machine except for timing considerations. In this scenario, it is important

to shield the state of one VM from the actions of another VM which runs on the same

physical platform. To ensure this, Popek and Goldberg [12, 13] mandated that attempts

to execute privileged instructions inside the VM should trap to the Virtual Machine

Monitor (VMM).

Satisfying this requirement causes a performance overhead for virtualized

workloads. Specifically, an entry into the VMM or an exit from the VMM involves

changing the CPU mode to the privileged mode and saving and restoring state related

information. Apart from the apparent overhead, these switches also pose an additional

demand on the CPU caches and thereby pollute them, posing further performance

overheads. Reducing such performance degradation is a significant challenge in the

area of platform virtualization. Another challenge on virtualized platforms is the need to

shield the performance of the workload running in one VM from the ”noise” or variation

due to the resource consumption of other VMs which share platform resources. While

this is not strictly required to maintain correctness of virtualization, performance isolation

is imperative to achieve predictable performance and for ensuring that the performance

14

of a high priority workload is not reduced due to the resource requirement of low priority

workloads.

When considering these performance related challenges, the most expensive

CPU cache is the Translation Lookaside Buffer (TLB) [14]. The TLB caches the

translations from the virtual to the physical address space and is in the critical path

of memory operations. Hardware-managed TLBs, which are typical in most virtualized

platforms [15], are flushed on every context switch to ensure that one process’s TLB

entries are not used for other processes. This flushing however, ensures that every

process which is switched into context experiences a large number of TLB misses until

the required entries are brought back into the TLB. Thus, the flushing of the TLB and

the subsequent TLB misses and page walks to service these misses constitute a delay

which slows down the performance of the process.

While typically tolerable in the case of non-virtualized systems, this performance

slowdown is quite high in virtualized consolidated scenarios due to the large number of

address spaces and the frequent switches between these address spaces as well as

the switching between the VM and the VMM. It is vital to reduce this TLB-induced delay

in virtualized platforms especially for performance critical applications. Many solutions

attempt to reduce this TLB-induced performance penalty, as explained in Chapter 4, by

sharing the TLB amongst multiple address spaces across context switch boundaries.

This, however, makes the hardware-managed TLB a shared resource and yet another

source of performance noise, necessitating TLB performance isolation solutions. Solving

these performance improvement and performance isolation challenges in the context of

the hardware-managed TLB forms the focus of this dissertation.

1.2 Contributions of the Dissertation

This dissertation makes three major contributions towards solving the challenges

outlined in Section 1.1. A brief outline of the contributions are presented here.

15

1.2.1 Simulation-Based Analysis of the TLB Performance on Virtualized Plat-forms

In order to understand the performance degradation caused by the high-frequency

TLB flushing on virtualized platforms and to investigate the impact of various schemes

that are proposed to reduce the TLB-induced delay, simulation frameworks supporting

detailed and customizable performance and timing models for the TLB are needed.

In fact, most works studying hardware-managed TLBs have used miss rates as the

metric for measuring the impact of the TLB [16–18] due to the lack of suitable simulation

frameworks supporting TLB timing models. While the reduction in miss rate is a suitable

initial metric, the true impact of the TLBs on the system performance can be obtained

only by using timing-based metrics.

In addition to satisfying the requirement for TLB models, simulation frameworks that

are used for studying virtualized scenarios should be full-system and execution-driven

to capture the interaction between the hardware, VMM, VM and applications. Moreover,

such simulation frameworks should support the simulation of x86 ISA since that is one of

the most popular virtualized platforms [15]. However, simulating x86 is difficult due to the

complex architecture and the fact that every x86 instruction is broken down into micro

operations (µops) which have to be simulated.

A survey of currently available simulators, as conducted in Chapter 3, clearly show

that there are few academic simulators that satisfy all these requirements. To address

this issue, a full-system simulation framework supporting x86 ISA and TLB models is

developed, validated and used to experimentally evaluate the performance implications

of the TLB in virtualized environments. This framework uses two existing simulators

(Simics and FeS2) as its foundation and incorporates a TLB timing model. This is the

only academic simulation framework that provides a detailed timing model for the TLB

and simulates the walking of page tables on a TLB miss. Moreover, this framework is

capable of simulating multiprocessor multi-domain workloads, which makes it uniquely

16

suitable for studying virtualized platforms. Using this framework, the TLB behavior

of I/O-intensive and memory-intensive virtualized workloads is characterized and

contrasted with their non-virtualized equivalents. It is shown that, unlike non-virtualized

single-O/S scenarios, the adverse impact of the TLB on the workload performance is

significant on virtualized platforms. Using the developed simulation framework, it is

shown that this performance reduction for virtualized workloads is as much as 35%

due to the TLB misses which are caused by the repeated flushing of the TLB and the

subsequent page walks to service these misses.

1.2.2 Tag Manager Table for Process-Specific Tagging of the TLB

To address this issue of TLB-induced performance reduction, this dissertation

proposes a novel microarchitectural approach called the Tag Manager Table (TMT). The

TMT approach involves tagging the TLB entries with tags that are process-specific, thus

associating them with the process which owns them. By tagging the TLB entries, TLB

flushes can be avoided during context switches, as well as during switches between

the VMM and the VM. This results in a reduction in the TLB miss rate. The TMT is a

small, fast, fully associative cache which is implemented at the same level as the TLB.

Every TLB has an associated TMT. Each entry in the TMT captures the context of a

process and stores a unique tag associated with this process which is used to tag the

TLB entries of this process. The TMT is designed to generate and manage these tags in

a software-transparent fashion while ensuring low latency of TLB lookups and imposing

a small area overhead.

The benefit of using the TMT and process-specific tagged TLB in virtual platforms is

estimated using the developed simulation framework. It is found that using process-specific

tags reduces the TLB miss rate by about 65% to 90% for typical server workloads

compared to using no tags. This reduction in miss rate effectively reduces the

TLB-induced delay by about 50% to 80% which, depending on the TLB miss penalty,

translates into a 4.5% to 25% improvement in the performance of the workloads. The

17

effectiveness of the TMT approach depends on microarchitectural factors including

the size of the TLB and TMT, the page walk latency and the workload characteristics,

including the number of processes and the working set size of the workload. On the

other hand, the associativity and replacement policy of the TLB play little role in deciding

the impact of the TMT. These various architectural and workload-related factors are

prioritized according to their impact on the benefit obtained from using the TMT.

The primary motivation for the Tag Manager Table is avoiding TLB flushes by

tagging the contents of the TLB with process-specific identifiers and thereby enabling

multiple processes to share a TLB. Since the tags are generated at a process-level

granularity and are not tied to any virtualization-specific aspect, the TMT may be used

to avoid TLB flushes in non-virtualized scenarios as well. In addition, sharing across

multiple per-core private TLBs using a hierarchical design with a shared Last Level TLB

(LLTLB) in order to exploit inter-TLB sharing [19], is made possible on platforms with

hardware-managed TLBs using the Tag Manager Table. This dissertation also shows

that, even for two unrelated workloads with little scope for inter-TLB sharing, shared

LLTLBs result in reducing the miss rate compared to private LLTLBs occupying the same

on-chip area by 15% to 28% due to a better usage of the TLB space. Another scenario

in which the TMT may be used is in tagging I/O TLBs, in scenarios where address

translation services and TLBs in the I/O fabric allow I/O devices to operate in virtual

address space, and synchronizing the I/O TLB flushes with the core TLB flushes. These

scenarios are investigated in this dissertation.

1.2.3 Mechanisms and Policies for TLB Usage Control

One of the advantages of virtualization is that, by consolidating applications which

stress different parts of the system, the average utilization of the entire system can be

increased. However, even completely disparate applications will share core platform

resources and influence the performance of one another depending on the consumption

of these core resources. Since the TMT enables the sharing of the TLB among multiple

18

workloads, it makes the TLB one such shared resource and renders the performance of

an application in one VM susceptible to variations due to the TLB usage of other VMs

sharing the TLB. This necessitates mechanisms and policies for controlling the use of

the TLB.

The third part of this dissertation addresses this need. First, the TLB space

utilization of consolidated workloads, with more than one VM running on the same

physical platform is characterized in order to understand the performance noise due to

shared TLBs and to motivate the need for explicitly controlling the usage by different

workloads sharing the TLB. Then, the CShare TLB architecture, consisting of the TMT

with a TLB Sharing Table (TST) to control the usage of the shared TLB, is proposed. It

is shown that the TLB behavior of a workload running in a VM can be isolated from the

TLB usage of other VMs running on the same platform by assigning fixed slices of the

shared TLB space using the TST to the various VMs. The use of the TST in improving

the overall performance of consolidated workloads or in selectively improving the

performance of a high priority workload by restricting the TLB usage of other low priority

workloads is explored. This dissertation shows that the performance improvement for

the high priority workload that is achieved by using the TMT without usage control can

be further increased by 1.4× by restricting the TLB usage of low priority workloads

using the TST. The cost of such selective performance enhancement for various types

of workloads and the use of dynamic usage control policies for minimizing this cost are

also investigated in this dissertation.

1.3 Outline of the Dissertation

The remaining part of this dissertation is organized as follows. Relevant background

information about virtual memory, TLBs and memory management in virtualized

systems is presented in Chapter 2. The design and validation of the full-system

simulation framework with the TLB timing model is described in Chapter 3 along with

an analysis of the TLB-induced performance degradation in virtualized workloads. The

19

architecture and functionality of the Tag Manager Table and the performance benefit

of using it is presented in Chapter 4. The use of the TMT in enabling shared LLTLBs is

also discussed in this chapter. The need for usage management policies in the TLB is

motivated in Chapter 5 and the use of the CShare TLB for achieving usage control with

static and dynamic policies is discussed in depth. The leveraging of the TMT to tag I/O

TLBs is proposed, simulated and validated in Appendix C. The conclusions from this

dissertation are summarized in Chapter 6.

20

CHAPTER 2BACKGROUND: VIRTUAL MEMORY AND PLATFORM VIRTUALIZATION

Virtualization can be viewed as the successor to emulation [20]. In the case

of computer systems, emulation is the process of duplicating the functions of a

target system using a different source system, so that the source system behaves

like the target system. The target system is usually emulated at the functional level.

Virtualization takes this concept to the next level by allowing a host system to behave

like multiple different guest systems [20].

Platform virtualization or full-system virtualization, one of the common types of

virtualization, is defined as the hiding of the physical characteristics of a computing

platform from users and showing an abstract computing platform. The abstraction thus

exposed is called a Virtual Machine. The virtual machine monitor (VMM) or hypervisor

acts as the control and translation system between the VMs and the physical platform

hardware. A VM behaves is the exact same way as a physical machine and, except for

timing considerations, is indistinguishable from a physical machine. The software stack

running inside a VM is unaware that it is not directly running on a physical machine.

Since the level at which the abstraction is provided tends to be the Instruction Set

Architecture (ISA), such virtualization is also known as full-system virtualization or ISA

level virtualization.

In addition server consolidation for harnessing the power of CMPs, as mentioned in

Chapter 1, virtualization has many advantages:

• In a server environment, virtualization reduces the cost of infrastructure bymaximizing the utilization of the resources and enhancing the managementcapabilities.

• Desktop Virtualization [21], the concept of using a thin and inexpensive client toaccess a virtual desktop running on powerful backend servers, enables simplerand inexpensive provisioning of desktops and lowers the costs for managingsecurity and deploying new software by the system administrator.

21

• Hosted virtual machines, wherein the VM runs as an application on the hostplatform along with several host-level non-virtualized applications, can be used toprovide an effective isolated sandbox for software testing and development [22].

• Virtualization enables utility computing and cloud computing [23]. Using servicemodels such as Infrastructure as a Service (IaaS) and Applications as a Service(AaaS), virtualization can provide economical and secure utility computing withguarantees of privacy and isolation of data and performance.

• Virtualization enables computing grids spanning widely distributed resources. Byproviding different users with virtual machine images [24, 25] which can scavengecomputing cycles from their resources, it becomes possible to create a pool ofcomputing power which can be used for large scale computing.

While virtualization provides better resource utilization and new paradigms of

computing, virtualizing a computer system is challenging. Specifically, in the case of the

memory subsystem, it is important to realize that memory is already virtualized even

on non-virtualized single-O/S systems. Platform virtualization adds yet another layer of

abstraction to this already-virtualized memory. Creating and managing these levels of

abstraction makes memory virtualization challenging. Since the work in this dissertation

lies in the domain of memory virtualization, some relevant background about memory

virtualization in non-virtualized platforms as well as virtualized platforms is presented in

this chapter.

2.1 Virtual Memory in Non-Virtualized Systems

Memory virtualization is a concept whereby an application is provided with an

abstraction of an address space that is different from the actual physical memory. This

abstracted address space is termed as virtual memory, virtual address space or linear

address space. By virtualizing memory and providing processes with unique virtual

address spaces, multiple processes can share the physical memory [20]. Using this

abstraction, applications can be written assuming a contiguous address space without

the programmer having to consider issues such as the size of the physical memory

and the range of addressable locations. Using virtual memory, a program can use

absolute addressing modes and can be easily ported from one machine to another

22

without needing any change. Memory virtualization may also be used for providing the

application with a memory space that may be in excess of the actual physical memory

available. Moreover, virtual memory may be used to enforce memory isolation amongst

multiple processes and restrict the type of accesses allowed on different memory

locations based on the semantics of the data stored at those locations.

2.1.1 Implementing Virtual Memory Using Paging

Memory is typically virtualized by paging. Here, the available physical memory is

partitioned as multiple regular-sized blocks called page frames. The virtual memory is

composed of blocks termed as pages, whose size is the same as the frames. Whenever

a certain virtual address needs to be accessed, the page containing that address is

fit on to a page frame in physical memory by mapping the virtual page to the physical

frame address. The page table stores the details of the virtual to physical mapping.

The process of converting a virtual address to a physical address, in order to access

memory, is known as address translation. Since address translation is a high frequency

operation in the critical path of all memory accesses, it is usually implemented in the

Memory Management Unit (MMU) hardware.

Address translation

Address translation consists of looking up the virtual to physical address mapping

from the page tables and this process is termed as the page walk. Since the page

table also contains information such as the types of operations permitted on the page,

address translation also provides some measure of isolation and protection. If the page

is not currently mapped in memory, a page fault is raised and handled by the system

software by mapping the virtual address page onto a free physical memory frame

or evicting an existing page from its frame and reusing the frame for the new page.

The page table for the new page as well as the evicted victim page are updated. The

contents of the page which has been evicted from physical memory is maintained in the

virtual memory disk cache.

23

A flat page table, which stores all the page mapping information in a single-level

table, is conceptually simple. But, the physical memory requirements for such a flat

table makes it prohibitively expensive. Hence multi-level page tables are used. Here,

the starting address of the first level page tables is usually stored in a register called

the Page Table Base Register (PTBR). In conjunction with the PTBR, a part of the

virtual address is used to index the first level of page tables. The contents of the indexed

location in the first level page table points to the start of the second level of page tables.

Along with the next part of the virtual address, this is used to index the second level

page table. This process is continued till the last level page table is indexed and the

physical address corresponding to the virtual address is obtained.

The set of hierarchical page tables may also be paged, i.e., parts of the hierarchical

page tables may reside in disk and can be brought into physical memory when needed.

In such cases the upper levels of the hierarchical page table are always maintained

in memory to avoid deadlocks. It should be noted that most systems allows the

existence of more than one page size. By using large pages, where a larger block of

contiguous physical memory is mapped to a single page, the size of the page tables can

be reduced. Such large pages are also termed as super pages or big pages.

2.1.2 Address Translation in x86 with Page Address Extension Enabled

Since x86 is the most popular virtualized architecture, the details of the address

translation process on x86 are warranted a close examination. Specifically, since the

system simulated in this work uses PAE addressing mode and most virtualization

solutions on 32-bit x86 use PAE addressing mode, the address translation in PAE mode

is described in detail in this section.

32-bit x86 has several different modes of paging, one of which is Physical Address

Extension (PAE) virtual addressing mode. With a 32 bit physical address, the maximum

addressable physical space is 4GB (232 bytes). Page Address Extension is feature of

the x86 architecture that allows access to more than 4 GB of RAM, if the operating

24

system supports it. In the PAE mode, a virtual address belonging to a 4KB small page

is translated in a four step process as shown in Figure 2-1. The CR3 register is the

PTBR for the x86 architecture and points to Page Directory Pointer Table. The two most

significant bits (MSBs) of the virtual address (VA) are used as an offset from the starting

address of the Page Directory Pointer Table and the Page Directory Pointer Table Entry

(PDPTE) is obtained, as shown in step ..1 of Figure 2-1. The PDPTE points to the base

of the Page Directory Table, which is the next level in the multi-level page table.

The 9 Least Significant Bits (LSBs) of the PDPTE contain attributes of all the

pages belonging to that Page Directory Table such as the Read/Write attributes and the

CPU privilege requirement for accessing these pages. These PDPTE attribute bits are

masked and replaced by an offset composed of bits 29 to 21 from the virtual address.

The resulting address is used to read the Page Directory Table entry (PDE) for this

virtual address, as shown in step ..2 .

Similar to the PDPTE, the 9 LSBs of the PDE are also attribute bits, which are

masked and replaced by the next 9 significant bits of the virtual address. This resulting

address is used and the Page Table Entry (PTE) is read, as in step ..3 . The PTE points

to the starting location of the physical memory page frame where the page containing

the virtual address is fit. Hence the PTE is sometimes referred to as the Physical

Frame Number (PFN) or Physical Page Number (PPN). The final step, step ..4 , consists

of accessing the page that is pointed to by the PTE and adding the 12 LSBs to get

the physical address (PA) corresponding to the virtual address. Since these 12 bits

indicate a byte within a page, they are termed as Page Offset and the remaining 20

MSBs as Virtual Page Number (VPN). It should be noted that the attributes of a page is

determined as the logical AND of the attributes from the PDPTE, the PDE and the PTE.

In the PAE mode, large pages of size 2MB are identified by bit 7 of the PDE being

set. Till the PDE is determined, the page walk for both large and small pages are

identical. But once the PDE is read and the page is found to be a large page, base

25

CR3

Page

Directory

Pointer

Table

Page

Directory

Page

Table

Page

PTEPDE

PDPTE

2 9 9 12

32

2 9 9 12

36 36 36

32 bit Virtual Address

36 bit Physical Address

DATA

1

23

4

Figure 2-1. Page walk for a 4KB page with PAE enabled

address of the large page, and not the PTE, is determined by using PDE. Then, the

remaining 21 bits of the virtual address are used as an index to the large page to access

the physical address corresponding to the virtual address.

2.2 Translation Lookaside Buffer

To speed up the page walk process, a small associative cache called the Translation

Lookaside Buffer (TLB) is used for caching the translations for the recently accessed

pages. The structure of a typical TLB is shown in Figure 2-2. Every entry in the TLB

contains three fields:

• The Virtual Page Number (VPN).

• The Physical Page Number (PPN) corresponding to the VPN.

• The attributes of the page indicating the write permissions for the page (R/W), theCPU mode required to access the page (S/U), the cacheability of the page and thetype of physical memory (MTRR, PAT) as well as the accessed and dirty state forthe page table entries corresponding to this translation.

26

Whenever an address translation is required, the TLB is first looked up to check if the

translation is cached, as shown in Figure 2-2. If the lookup hits in the TLB, the page

offset from the virtual address is used along with the PPN from the TLB entry to get the

physical address without having to go through the entire page walk. On a TLB miss,

however, the page tables are walked and the address translation is obtained. Depending

on the replacement policy, a victim is evicted from the TLB and that slot is populated

with the VPN, PPN and attributes obtained from the page walk.

VPN PPN

VIRTUAL ADDRESS

MEMORY

HIT

VPN PAGE OFFSET

Figure 2-2. Translation Lookaside Buffer for caching the recently used virtual to physicaladdress translations.

TLBs can be broadly classified into Software-Managed TLBs or Architected TLBs,

such as in SPARC and ALPHA [26, 27] and Hardware-Managed TLBs, such as in

x86 [28], depending on the behavior on a TLB miss. In software managed TLBs,

the TLB raises a fault on a TLB miss which is handled in a fashion similar to any

general interrupt. The pipeline gets flushed [29] and the page walk is performed by

the O/S. Once the page walk is completed, the TLB is populated and then the pipeline

is restarted. The advantage of the software managed TLB is that the O/S may use

intelligent schemes to populate the TLB and redefine the organization of the page table

to suit the new schemes. However, the time taken for the page walk is significantly

higher than in hardware-managed TLBs and the page walk process may pollute the

instruction cache.

27

In hardware-managed TLBs, the structure of the page table and the format of the

page table entries are defined by the ISA and are fixed. When a TLB miss occurs, a

hardware state machine walks the page tables, determines the translation and populates

the TLB. This mechanism is much faster than a software managed TLB [30], since the

page walk happens entirely in hardware. Moreover, it does not stall the pipeline and

instructions which are not dependent on this particular translation can be executed out

of order [31]. The disadvantage of hardware-managed TLBs is during a context switch.

When there is a context switch from one process to another, the hardware-managed

TLB gets flushed to avoid using the TLB entries of the first process for the second

process. In software managed TLBs, however, most operating systems tag the contents

of the TLB with some ID which relates the entries to the process to which they belong

and thereby avoid flushing the TLB on context switches. Thus, with hardware-managed

TLBs, every process which is switched into context experiences a large number of TLB

misses until the required entries are brought back into the TLB.

2.3 Virtual Memory in Virtualized Systems

As seen in the previous section, a non-virtualized system has two levels of memory:

the physical memory and the virtual memory which is an abstraction of the physical

memory and which gets exposed as a unique address space to every process. With

platform virtualization, the virtual memory is abstracted by the VMM and is presented as

physical memory to the VM. This memory is further virtualized by the guest O/S running

on the VM. To avoid ambiguity, this level of memory is referred to as ”real memory”.

The three different levels of memories in a virtualized platform are clearly indicated in

Figure 2-3.

In the three-level memory architecture of a virtualized platform, the page tables

maintained by the guest O/S contain translations between virtual memory and real

memory. Similarly, the page tables maintained by the VMM contain the mapping

between real memory and physical memory. It is this abstraction of the physical

28

PHYSICAL PLATFORM

VMM

VM VM

VIRT

CPUREAL

MEMORY

VIRT

I/O

GUEST OS GUEST OS

APPLICATION

VIRTUAL

MEMORY

APPLICATION

VIRTUAL

MEMORY

APPLICATION

PHYSICAL

CPU

PHYSICAL

MEMORY

GUEST

PAGE

TABLES

SHADOW

PAGE

TABLES

PHYSICAL I/O

Figure 2-3. Memory virtualization in a virtualized platform

memory into real memory that achieves the goal of virtualizing memory at the VM-VMM

interface. Because of this three-level memory abstraction, the virtual address seen by an

application inside a VM has to be translated to the real memory domain using the page

tables of the VM. Then, this real address has to be translated by the VMM to physical

memory and the required data should be accessed. However, while maintaining two

sets of page tables is conceptually simple, it is rarely used due to the cost involved in

maintaining two sets of page tables. Rather, this is handled in one of the following three

ways.

2.3.1 Full-System Virtualization and Shadow Page Tables

Full-system virtualization solutions such as VMware uses the concept of shadow

page tables [32]. The VMM maintains a set of shadow page tables (SPTs), one for every

process in every guest VM. These SPTs are invisible to the guest O/S and map the

virtual memory pages directly to physical memory. By using the SPTs, one set of page

walks can be eliminated, thereby making the address translation process faster.

29

To achieve this, the Page Table Base Register (PTBR) is virtualized. When starting

a guest, the VMM populates the physical PTBR with the location of the shadow page

tables and the virtual PTBR with the real memory location of the guest O/S’s page

tables. Whenever the guest attempts to read or write the PTBR, the instruction traps to

the VMM. If this a write attempt, which may be caused by a context switch inside the

guest, the virtual PTBR is updated with the real memory address pointing the page

tables of the new process. The physical PTBR is then updated by the VMM to point to

the physical memory location which contains the shadow page table of the new process

of the guest VM. If the attempt is a read attempt, the VMM returns the virtual PTBR

value to the guest O/S.

While the SPT effectively eliminates one level of memory indirection, it introduces

the need to maintain consistency between SPTs and guest page tables. For instance,

if a certain virtual page is not mapped to the real memory according to the guest page

tables, then the shadow page tables for that process should not contain a mapping. This

is needed in order to ensure that the occurrence of page faults is consistent, irrespective

of whether the application is running in a guest O/S or on a non-virtualized platform.

Thus, page table management becomes a source of virtualization overhead.

2.3.2 Paravirtualization and Page Tables

In a traditional VMM, the virtualized abstraction that is exposed as VM is identical

to the underlying physical machine [33, 34]. Hence, operating systems need not be

modified to run in a guest VM. However, the cost of maintaining this abstraction of

identical hardware is high.

Xen [35] takes the approach of presenting the guest with a similar but nonidentical

abstraction of the real hardware using a technique called paravirtualization. Due to

the differences between real and virtual hardware, the O/S has to be patched to run in

the paravirtualized VM (which are referred to as domain or dom in Xen terminology).

30

However, only the O/S requires patching and unmodified binaries can still be run on this

patched O/S inside the doms.

Xen handles memory virtualization by allowing guests to directly view the physical

memory and thereby eliminating the intermediate real memory [35]. The configuration

file for a user domain (domU) includes a request for a certain amount of memory. If

sufficient physical memory is available, Xen allocates the requested amount of physical

memory and reserves it for domU. Such a reservation allows the guests to directly

view their allocated physical memory and imposes strong isolation from other domains.

Whenever a modified guest O/S needs memory, it allocates a page from its reserved

pool of physical memory and registers this allocation with the Xen hypervisor.

The page tables for the processes, which are maintained by the guest, are made

unwritable by the guest. Whenever the guest O/S desires to update the page table, it

does so by issuing a hypercall. Xen verifies that the write request from the guest O/S

is valid and makes the requested changes in the page tables. To improve performance,

multiple such hypercalls may be batched and issued by the guest O/S to avoid frequent

switching between the VM and the hypervisor.

Eliminating the real memory removes the need to maintain shadow page tables.

However, this poses a conflict with the contiguous physical address space model that is

assumed by most guest O/S. Xen handles this by provides a pseudo-physical memory,

which may be thought of as an analog to real memory, and by rewriting the parts of the

guest O/S which depend on physical memory contiguity to use this pseudo-physical

memory.

2.3.3 Hardware Virtualization and Two-Level Page Tables

While Xen avoids the overhead of shadow page table management, which may

be as high as 75% of the total execution time of an application [36], it still does not

completely eliminate the memory virtualization overheads. The need for the hypervisor

during page table updates and for providing pseudo-physical memory are two instances

31

of virtualization overhead in Xen. To avoid these overheads associated with software

methods of virtualizing the memory, both Intel [37] and AMD [36] have developed

hardware solutions by extending the MMU of the x86-64 and amd-64 architectures

respectively. These solutions, involving two levels of page tables, are known as Nested

Page Tables (NPT) and Extended Page Tables (EPT) by AMD and Intel respectively.

NPTs and EPTs provide two levels of page tables. The first level of page tables,

called guest page tables (GPTs) are similar to regular page tables and are used to map

virtual addresses to real addresses. The second level of page tables, called Host page

tables, are maintained by the hypervisor and contain the mappings between real and

physical address spaces and are managed by the VMM. Both the guest and the VMM

have their own copies of the PTBR (CR3). The guest CR3 points to the start of the guest

page tables and the host CR3 points to the base of the EPT/NPT.

When a virtual address has to be translated to physical address, a two-dimensional

page walk takes place. The guest CR3, along with the MSBs of the virtual address,

indicate the address of the first-level page table entry in real memory. This address is

translated to the physical memory domain by walking the host page tables using the

host CR3. The translated physical address is used to read the first-level page table

entry of the guest page tables, which is then translated from real to physical memory.

By repeating this process, the physical address corresponding to the linear address is

obtained.

By allowing the guests to manage their page tables, the need for trapping MMU

related instructions is avoided. This reduces the overhead of memory virtualization.

It should be noted that, even with nested page tables, the TLB still caches virtual to

physical address translations rather than virtual to real address translations. Moreover,

the cost of a TLB miss increases significantly compared to non-nested page tables when

NPTs/EPTs are used, further increasing the need to reduce the TLB misses.

32

2.4 Summary

The background information about memory virtualization in non-virtualized and

virtualized systems presented in this chapter clearly demonstrate the complexities of

virtualizing memory. In addition to this complexity, many of the strategies that have been

used to reduce the latency of page table management, such as using EPT/NPT, as well

as the switches between the VM and the VMM necessitated by page table management

operations, have implications on the behavior of the Translation Lookaside Buffer. These

implications, the performance delay caused by the TLB and avoiding this performance

delay forms the focus of the remainder of this dissertation.

33

CHAPTER 3A SIMULATION FRAMEWORK FOR THE ANALYSIS OF TLB PERFORMANCE

The growing use of virtualization for server consolidation on CMP platforms [5,

24, 38] has emerged as a new paradigm in the high-end server computing industry.

However, one issue with such virtualization-based resource consolidation is the

performance degradation of virtualized workloads. In fact, improving the performance of

virtualized workloads to near-native levels has been the focus of much research [6, 39–

45]. The x86 architecture, which is one of the most popular virtualized platforms [15],

has also been modified with hardware virtualization extensions to improve the

performance of virtual machines. Starting with the VT extensions [46], there have

been many changes in this direction including Intel VT for Connectivity and Intel VT

Directed I/O [47]. Similar developments from AMD include the AMD-V virtualization

technology [36] and the Direct Connect Architecture [48].

As mentioned in Chapter 1, the TLB is critical in determining the performance of

virtualized workloads [14]. Hence, it is no surprise that the most recent virtualization

extensions to the x86 architecture have focussed on the TLB. Specifically, the TLB

architecture has been modified by the addition of tags as a part of the TLB entry and

by providing hardware primitives for rapid tag comparison [36, 37, 48]. Due to these

changes in the TLB architecture, there is a need for reexamining and understanding the

TLB behavior of workloads in virtualized settings in order to solve issues involving tag

generation and management. Furthermore, the optimum tagged TLB architecture, in

terms of size and associativity, should be explored.

One way of obtaining this understanding is by conducting a simulation-based study

wherein the effect of various architectural and workload related parameters on the TLB

performance can be explored. Moreover, using such a simulation-based approach

will facilitate understanding the impact of the TLB on the performance of virtualized

34

workloads and will allow the comparison of various TLB-related performance-enhancing

ideas.

3.1 Survey of Simulation Frameworks Used in TLB-Related Research

The Translation Lookaside Buffer has been the target of many research works.

TLB prefetching [49–51] has been explored to increase the TLB hit ratio. Chadha et

al. [52, 53] have used functional models with SoftSDV [54] simulator to study the TLB

behavior of I/O-intensive virtualized workloads. Tickoo et al. [18] have explored TLB

tagging in their qTLB approach. Ekman et al. [55] estimate the TLB to be responsible

for up to 40% of the power consumption in caches. Various circuit-level and architectural

techniques [16, 56–59] as well as compiler-level code transformation [60] have

been explored to reduce the TLB power consumption. However, these previous

studies involving the hardware-managed TLB (such as the x86 TLB) have used

SimpleScalar [61] or custom-built trace-driven simulators [62] and not TLB timing models

in a full-system environment, thereby ignoring the interaction of the workload with the

O/S/VMM. Even in cases where full-system simulation has been used, the TLB timing

has not been modeled [52, 53] or the x86 architecture has not been simulated [50, 51].

A possible reason why the studies involving hardware-managed TLBs on x86 have

not used timing-based metrics, or use simplified simulators which are not full-system

simulators and tend to ignore hypervisor effects, may be the lack of simulator support.

Commonly used x86 simulators are either not full-system simulators or do not model the

timing behavior of the TLB. Zesto [63], which supports cycle-accurate simulation for x86

and models the TLB cannot boot an O/S and does not support full-system simulation.

PTLSim/X [64] is a full-system simulator for x86 that can simulate an entire O/S and the

binaries running inside it, by running the O/S as a guest on top of a modified version

of Xen. However, it is not capable of simulating the hypervisor itself, which makes it

unsuitable for full-system studies on virtualized platforms. SimOS [65] supports the x86

architecture, but it does not support running a virtual machine monitor. M5 [66], while

35

providing full-system support and timing models, does not support x86 architecture.

Simics [67] is a full-system simulator that is capable of booting and running Xen and

multiple guest O/S, but requires extensions to support timing studies. GEMS [68]

provides one such timing framework, however it does not support the simulation of

the x86 ISA. FeS2 [69] is an accurate execution-driven timing model that includes a

cache hierarchy, branch predictors and a superscalar out-of-order core. It supports x86

and can be plugged into Simics. COTSon [70] is a similar timing simulator that can be

plugged into AMD SimNow [71]. But neither FeS2 nor COTSon provide timing models

for the TLB.

Thus, there is a clear need for a simulation framework for simulating the behavior of

hardware-managed TLBs on virtualized platforms that meets the following requirements:

• The framework should support configurable TLB functional and timing models.Since recent hardware-managed TLBs incorporate tags as apart of the TLB entry,the functional TLB model should support the simulation of tagged TLB functionalityas well.

• As x86 is the most common virtualized platform, the simulator should support thesimulation of x86 ISA. It is also desirable that the framework simulates the x86 ISAat the micro-operations (µops) granularity.

• To capture the interaction between the hardware, the VMM, the VM and theapplication, it is imperative that the simulator be a full-system execution-drivenframework.

Developing such a simulation framework forms the focus of this chapter.

3.2 Developing the Simulation Framework

The full-system simulation framework developed for analyzing the TLB behavior

on virtualized platforms uses Simics [67] and FeS2 [72] as foundations. The basic

functional TLB model in Simics is replaced with a generic tagged TLB model. TLB

timing models are also developed and incorporated into the timing flow of FeS2. These

components of the simulation framework are described in this section.

36

3.2.1 Using Simics and FeS2 as Foundation

The simulation framework, shown in Figure 3-1, consists of Virtutech Simics [67]

(version 3.0.1), a full-system simulation platform capable of simulating high-end

target systems with sufficient fidelity and speed to boot and run operating systems

and workloads. Simics uses a functional CPU model with atomic and sequential

execution of instructions, wherein the execution of every instruction takes exactly one

cycle. The processor model is non-pipelined and only x86 CPUs without hardware

virtualization support are modeled. Simics also provides a rich set of microarchitectural

components including the cache and TLB which can be incorporated with the CPU. In

such simulations, the execution time for an instruction is increased by any stalls that may

be caused by the memory subsystem for that instruction, but the execution model is still

sequential. Moreover, only the caches and the memory can stall an instruction and the

hit and miss latencies associated with the TLB are ignored.

Simics also provides the capability to install callback functions and associate these

with the occurrence of specific events such as TLB misses and context switches. While

Simics provides a microarchitectural interface (MAI) timing model, which emulates a

pipeline and out of order execution, it does not simulate at the granularity of x86 micro

operations (µops).

To support timing-based analysis, a timing model based on the FeS2 [69] simulator

is used. FeS2 works on a timing-first methodology, where the functional correctness is

provided by Simics and the timing information by FeS2. An x86 instruction is fetched,

decoded in µops, using the decoder from PTLSim [64], which are then executed and

retired. During the retirement phase, the corresponding x86 instruction is allowed to

execute in Simics. Then, the state of the system maintained by FeS2 is compared to the

functionally-correct state maintained by Simics. In case of these states not matching up,

the FeS2 pipeline is flushed and restarted at the next instruction. FeS2 relies on Simics

to supply the functional data such as the contents at a given memory location and the

37

translation for a given virtual address. Thus FeS2 provides an effective ”timing plugin”

to the Simics simulator. Coupling FeS2 with Simics creates a framework which satisfies

all the requirements for simulation studies involving virtualized workloads, except for the

lack of advanced TLB functionality (like tagged TLB) and a timing models for the TLB.

Xen 3.1.0 / 2.6.18-Xen

Physical Machine

Linux

Simics Full System Simulator

Memory3GB

FunctionalCPU

FeS2 Timing

Model

+

TLB Timing

Model

Dom 2

MEM 1GB

2.6.18-Xen

Workload 2

VCPU

Dom 1

Workload 1

MEM 1GB

2.6.18-Xen

VCPU

Dom0

MEM 1GB

2.6.18-Xen

VCPU

Tagged TLB

TAGProcess/VM

GMT

TAG

Extended TLB

VPN PPN

TagCache

TAG COMPARATOR

Figure 3-1. Simulation framework for analyzing TLB performance. The framework is builtusing Simics and FeS2 as foundations. A generic tagged TLB functionalmodel as well TLB timing model is incorporated

3.2.2 TLB Functional Model

The x86 processor model in Simics [73] has a functional TLB model consisting

of four 64-entry 4-way associative TLBs. These TLBs are organized as two DTLBs

and two ITLBs, i.e., for the 4KB small pages and large pages, each. First In First Out

(FIFO) replacement policy is used in these TLBs. As this TLB functional model does not

support storing tags as a part of the TLB entry or incorporate tag checking as apart of

the TLB lookup, a generic tagged TLB functional model is created.

The tagged TLB model consists of four components as shown in Figure 3-1: 1. the

Generation and Management of Tags (GMT) module, 2. the extended TLB which stores

38

a tag as a part of every entry, 3. the TagCache which stores the current tag and 4. a

tag comparator for comparing the tags during TLB lookup. Depending on details of the

specific tagged TLB solution being modeled one or more of these components may not

be needed. For instance, when modeling a tagged TLB solution where the assignment

of tags is done by the system software, the GMT need not be simulated. However,

creating models for all these components makes this tagged TLB model flexible enough

to simulate any tagging solution.

To add the tagging functionality, the GMT, TagCache and comparator are added

as model extensions to Simics, similar to the AntFarm extension by Jones [74]. The

GMT is implemented in such a manner that it is capable of examining the state of

the CPU of which it is a part. The Simics TLB model is extended by adding tags as a

part of the data structure for every entry. In addition to the FIFO replacement policy,

an LRU replacement policy with the timestamps based on the Simics clock is added.

The TagCache is modeled as a register which is wide enough to cache one entry of

the GMT. The comparator functionality is implemented by looking up the current tag

from the TagCache and using this a part of the TLB lookup logic. APIs to facilitate

communication between the GMT and the TLB are also implemented. Every time a TLB

flush is triggered by writing a new value to the CR3 register, the extended TLB module

communicates this new value to the GMT module using these APIs. The GMT makes

the appropriate changes and updates the TagCache. The GMT then, depending on the

functionality being simulated, indicates if the TLB flush can be avoided or not. If the TLB

flush cannot be avoided, the extended TLB’s contents are flushed.

3.2.3 Validation of the TLB Functional Model

The validation of the TLB functional model consists of verifying that the TLB is

functionally correct when the tags are used to avoid TLB flushes. Any error in the

functionality will result in retaining stale entries which are inconsistent with the page

tables. Hence, verifying the consistency of the TLB entries serves to validate the

39

tagged TLB implementation. For this, a Functional Check mode is implemented. In

this mode, whenever there is a hit in the tagged TLB, a page walk is performed to

get the translation TransPW consisting of the physical address corresponding to the

linear address and all the page attributes such as the read/write bit, the global bit, the

page mode bit, the PAT and the MTRR bits. This translation is then compared to the

translation TransTLB present in the tagged TLB. If these translations do not match, an

inconsistency is declared. It should be noted that Functional Check mode severely slows

down the speed of simulation and is used only for validation of the TLB functional model.

3.2.4 TLB Timing Model

FUNCTIONAL

SIMULATOR

FunctionalMemory

FunctionalCPU

FunctionalITLB

FunctionalDTLB

Fetc

hA

nd

Deco

de

Ren

am

e

Ex

ecu

te

Co

mp

lete

Co

mm

it

Tim

ing

ITL

BM

od

el

Tim

ing

DT

LB

Mo

del

TIMING FLOW FUNCTIONAL DATA FLOW FOR TLB MODEL

TIMING

SIMULATOR

TaggingFramework

(GMT, TagCache)

1

1 2 3 4

4 A D FUNCTIONAL DATA FLOW FOR FeS2

E

A

B

C

D

E

Figure 3-2. Timing flow in the simulation framework. FeS2 plugging into Simics and theTLB timing models plugging into FeS2 are shown. The flow of timing duringa TLB lookup is illustrated.

FeS2 does not implement either the instruction or the data TLB. Whenever an

address translation is needed, FeS2 queries Simics using a Simics-provided API. This

API returns the translation irrespective of whether it is present in the Simics functional

TLB or not. If the functional TLB does not contain the needed translation, Simics walks

40

the page table, computes the translation, populates the TLB and returns the translation,

completely transparent to FeS2. Moreover, the details of any cache misses caused by

the page walk are also not communicated to FeS2 by this API. Thus, FeS2 is unable

to account for different execution times for a µop depending on whether the lookup it

triggered hit or missed in the TLB and, in case of miss, whether there were any cache

misses.

This behavior of FeS2 is modified by implementing timing models for ITLB and

DTLB and integrating them into FeS2 as shown in Figure 3-2. After the addition of

these models, the fetch-and-decode stage queries the timing model, instead of using

Simics API, whenever an address translation is needed. This path is shown by the arrow

labeled ..1 in the Figure 3-2. The timing model queries the functional TLB model as

shown by arrow ..A . If the translation is not present in the functional TLB, the timing

model reads the CR3 value and calculates the first address to be looked up in the page

walk process. It then inserts a lookup for this address in the cache hierarchy maintained

by FeS2. Once this lookup returns, the actual value stored at this address is obtained

from the Simics functional memory, as shown by arrow ..B , and used to calculate

the next address in the page walk process. This process is repeated until the entire

translation is computed. Once computed, the functional TLB is populated using this

translation as shown in Figure 3-2 by arrow ..A . If the functional model is simulating a

tagged TLB, the populated entry is tagged with the corresponding tag and timestamped.

Then, the instruction which has been stalled during this process is released as shown by

arrow ..2 .

Similarly the DTLB timing model is queried if an address translation is needed for

a Load or a Store instruction in the execute stage and returns after a certain latency

shown by arrows ..3 and ..4 respectively. The flow of the functional data between the

DTLB timing model and the functional TLB is shown by arrow ..C . In case of the lookup

missing in the DTLB and triggering a page walk, the data flow between the DTLB timing

41

model and the memory is shown by arrows ..D in Figure 3-2. After this lookup returns,

the execution of the µop which was stalled is allowed to continue.

The latency of a TLB lookup depends on whether the required information is found

in the functional TLB. If it is a miss, then the page walk latency (PW) determines time

for which the corresponding instruction or µop is stalled. This page walk latency (also

referred to in this dissertation as TLB Miss Penalty) is the minimum number of stall

cycles experienced by a µop due to a TLB miss whose page walk does not miss in

the L1 cache. If there are any cache misses in the page walk, the µop will be stalled

for the latency of those misses in addition to this page walk latency. Thus, a proper

choice of the page walk latency is important. To determine these values for the TLB, the

RightMark Memory Analyzer (RMMA) [75] is utilized. RMMA allows the estimation of

vital low level system characteristics including the latency and bandwidth of the RAM,

the average and minimal latencies along with the size and associativity of different levels

of cache and the TLB. The RMMA test suite is run on a 64 bit Intel Core2 Duo CPU

running 32 bit Windows XP. From the results of this experiment, a default page walk

latency of 60 cycles is chosen.

3.2.5 Validating the TLB Timing Model

As described in Section 3.2.1, this simulation framework is built on top of well

documented and established simulators i.e. Simics and FeS2. Hence the validation

process is confined to the TLB timing model that has been developed in this work.

Validation of the timing part of the simulation framework consists of ensuring that

the behavior of the TLB timing model is as expected. For this validation, a simplified

pipeline, with the width of every stage set to one, is considered. This ensures that a

stall in one particular µop will stall the entire pipeline and no out-of order execution is

possible. It should be noted that this simplification is only for the validation process

and an un-simplified pipeline with out-of-order execution capability is used for the

experiments discussed in the remainder of this dissertation. The size of the L1 and the

42

Table 3-1. Pseudocode of the micro benchmark for TLB timing model validation

/*

* Pseudocode of the micro benchmark with well defined TLB behavior

*/

int main ()

/* Step 1 */

allocate_contiguous_pages(64);

/* Step 2 - Warmup */

touch_first_pages(64);

/* Step 3 - TLB Miss Producing Section */

/* The number of misses produced by Step 3 is a function of the

* TLB size and the number of pages being touched T.

*/

touch_first_pages(T);

return 0;

L2 caches are set to large values of 2 MB, thereby ensuring that the page tables are

cached and the stalls due to page walk related cache misses are minimized. Thus, in

this simplified scenario, the primary cause of memory subsystem stalls are the TLB

misses and the ensuing page walks.

Then, a micro-benchmark with a well defined TLB behavior, for which the number of

TLB misses for a given TLB size is predictable, is created. This pseudo-code is shown

in the listing in Table 3-1. The micro-benchmark consists of three steps. In the first step,

a contiguous block of N pages, each of size 4KB, is allocated. In step 2, the first byte of

each of these N pages are accessed to warm up the TLB and cache with the necessary

page tables entries. Then, in step 3, the first T of these N pages are accessed and

some value is written into these pages.

If the TLB is large enough to hold all the N translations (along with the required

O/S/VMM translations) which were looked up in step 2, then step 3 will not cause any

43

misses in the TLB. On the other hand, a smaller TLB will result in about T misses.

Thus, the time for executing step 3 depends on the number TLB misses, which in turn

is decided by the TLB size. In such a scenario, the execution time for step 3 in the

simplified pipeline can be theoretically estimated for various TLB sizes. Comparing

these estimations to the values obtained from simulations using the TLB timing model

serves to validate the TLB timing model.

0

1000

2000

3000

4000

5000

6000

16 32 48 64

Number of Pages Touched T

Sim

ulat

or C

lock

Cyc

les

DEst, PW Lat 30

DSim, PW Lat 30

DEst, PW Lat 60

DSim, PW Lat 60

DEst, PW Lat 90

DSim, PW Lat 90

Figure 3-3. Validation of the TLB timing model. The estimated value (DEst) andsimulated value (DSim) of the difference in the execution time for step 3 of themicro-benchmark in Table 3-1, with 64 entry and 256 entry TLBs, areobtained and compared. The simulation values match the estimated valuesquite closely.

Two fully-associative TLBs of sizes 64 entries and 256 entries are considered. By

ensuring that the TLBs are fully-associative, the TLB size becomes the only determinant

of the number of misses. Since the number of TLB misses for a given TLB size can

be predicted, the time for executing Step 3 with these two TLB sizes is estimated and

the difference in these times, DEst is calculated. Then, the microcode is simulated

with fully-associative 64-entry and 256-entry TLBs using the developed TLB timing

44

model. The execution time for Step 3 is noted and the difference between the execution

times obtained from the simulations, DSim is calculated. This experiment is repeated

for different values of T and different values of page walk latency and the comparison

of the obtained DSim and DEst values is shown in Figure 3-3. From this, it can be seen

that the difference as obtained from the simulator tracks the theoretically estimated

difference quite closely. The maximum deviation between DEst and DSim is about 6.59%

for T = 64 and a small page walk latency of 30 cycles. For larger page walk latencies

the deviation drops to less than 3.5%. This verifies that the behavior of the timing model

is as expected.

3.3 Selection and Preparation of Workloads

The advantage of a full-system simulation framework, such as the one described in

Section 3.2, is that it allows the running of system and application software stack on the

simulated platform. This section describes the software stack used in this dissertation.

For the single-O/S scenario, Debian Linux 2.6.18 kernel with PAE support is

booted on the Simics simulated ”physical machine” and the workload applications

are launched as processes in this Linux environment. For the virtualized scenario,

Xen [35] is selected. Xen is an open-source hypervisor which can support para-virtual

guests running modified versions of operating systems (XenoLinux), or Hybrid Virtual

Machines running un-modified O/S (if the processor has virtualization support built in).

Since virtualization extensions are not supported by the Simics x86 CPU models, the

paravirtual version of Xen is used. On top of the Simics simulated ”physical machine”,

Xen-3.1.0/2.6.18-xen kernel with PAE support and has HAPs compiled in it to trigger

various functions during inter-domain switches, is booted. On booting, Xen starts up

a control VM or domain called dom0. From this domain user domains or domUs are

created and the workload applications are launched inside the user domains.

45

3.3.1 Workload Applications

One common workload which is used to benchmark virtualized platforms is

VMMark [76]. Here, common server applications including Outlook Mail server, Apache

webserver, Oracle database server, SPECjbb and dbench are put together to form

a consolidated workload. Due to licensing issues in using VMMark, a similar suite of

applications is created in order to have varied workloads. The applications included in

this suite are:

• TPCC-UVa [77], an open source implementation of the TPC-C benchmarkstandard, which represents typical database transaction processing serverworkloads. It uses the PostgreSQL database system and a simple transactionmonitor to measure the performance of systems and forks off one client processper warehouse. In all the simulations in this dissertation, the number of warehousesis set to 4.

• dbench[78], a disk I/O-intensive file server. Similar to TPCC-UVa, dbench is anI/O-intensive workload. However, the I/O component in dbench is much more thanTPCC-UVa.

• SPECjbb 2005 [79], another OLTP class workload. SPECjbb differs fromTPCC-UVa as it emulates only the server side of an OLTP system [79], whereasTPCC-UVa emulates both client and server operations. Moreover, SPECjbb2005has a significantly large memory requirement [80, 81] as its entire database is heldin memory, whereas TPCC-UVa stores its database on disk and accesses it asneeded. In all the simulations conducted in this research, the heap size of the JVMin which SPECjbb is set at 256MB.

• Vortex [82, 83], a database manipulation workload from the SPEC CPU 2000 suiteof benchmark. This workload, similar to SPECjbb, also uses significant amount ofmemory.

3.3.2 Consolidated Workloads

Consolidated workloads consist of multiple applications constituting the effective

workload. Consolidated workloads are created by running every application as

different processes in Linux. To generate such consolidated workloads on Xen,

the first application is run on its domain and paused using the Xen management

tools [84], when the point of interest is reached. The point of interest is the phase

46

where the warmup phase, like reading the database into memory for typical database

transaction processing workloads, is completed and the long-running service phase

begins execution. By repeating this process for all the applications, multiple virtual

machines with the applications running inside them are brought up. All the paused

VMs are then resumed at the same time, ensuring that all applications contribute and

influence the behavior of the consolidated workload.

3.3.3 Multiprocessor Workloads

Both uniprocessor and multiprocessor simulations may be performed using the

developed simulation framework. In multiprocessor scenarios, Xen allows pinning [84] of

virtual CPUs (VCPUs) to physical CPUs. Pinning is a concept wherein a certain VCPU

is associated with one or more physical CPUs. This restricts the scheduling of the VCPU

to one of the physical CPUs to which it is pinned. By an intelligent use of the pinning

mechanism, long running domains can be given their own CPUs to ensure uninterrupted

performance.

The terminology used in this dissertation to describe pinned configurations is

illustrated considering an example setup with the simulated ”physical” machine having

two x86 CPUs is considered. Xen is booted on this machine and dom0 is started with

two VCPUs. In addition to dom0, two virtual machines with one VCPU each, dom1 and

dom2, are created. The workload running on both the user domains is TPCC-UVa. This

configuration is termed as ”TPCC-TPCC-nopin” as no domain is explicitly pinned to any

CPU.

Then, using the pining commands of Xen, the dom0 VCPUs are restricted to run

only on ”physical” CPU0 and the VCPUs of dom1 and dom2 are bound to ”physical”

CPU1. Since, only dom0 can be scheduled on CPU0 and only dom1 and dom2 can be

scheduled on CPU1, this pinning configuration is termed as ”TPCC-TPCC-0012”. In the

case of uniprocessor simulations, pinning makes no difference as there is only one CPU

47

on which all the VCPUs are scheduled. Hence the nopin/pin annotation is ignored for

single processor scenarios.

3.3.4 Checkpointing Workloads

A typical usage model for low-throughput high-fidelity simulators is checkpointing. In

such cases, the simulation is run till a certain point in a mode where the simulation

throughput is quite high. Invariably, the data obtained during this phase is small

and is ignored. Once the point of interest has been reached, the simulation state is

checkpointed. Then the simulation is restarted in a low throughput mode where the

fidelity of simulation and the quality of data obtained is high. Such a usage scenario is

possible using this developed framework. Simics [67] supports checkpointing where

the entire state of the system, including memory and I/O subsystems in the form of

compressed files. These files can be copied from one machine to another and used

without any loss of data. Using this method, checkpoints of the single and multi-domain

workloads are prepared. A screenshot of a simulated machine running 6 domains

is shown in Figure 3-4. Further details of using these checkpoints for long running

parametric sweep type simulations in batch mode are discussed in Appendix B.

3.4 Evaluation of the Simulation Framework

One of the biggest disadvantages of a full-system simulation framework is that the

speed of simulation is much lower compared to trace-driven simulators. This is indeed

one reason why trace driven simulators are preferred when only one subsystem is under

consideration. In this section, the speed of the simulation framework with and without

timing models is examined. The speed of various simulation modes is characterized

by the throughput of the simulation framework calculated, as shown in Equation 3–1,

as the number of x86 instructions simulated using the framework in a given second

of wall clock time. The results from these investigations will help understand the time

requirements involved in simulation-based analysis and plan accordingly. Such an

understanding is important, when simulations are performed on shared resources

48

Figure 3-4. Screenshot of the simulation framework in use. The uniprocessor simulatedmachine has six user domains (domU) and one control domain (dom0). Fiveof the six user domains are paused, while dom1 is running TPCC-UVa

using schedulers such as Maui [85] and Torque [86], where the user has to provide the

anticipated time for the simulation to aid in scheduling the jobs properly.

Throughput =Simulated x86 instructions

Wall Clock Time(3–1)

To evaluate the speed of simulation when the TLB timing model is used, a 3GB

1-CPU x86 ”physical machine” is simulated using Simics. The TLB is configured to be

fully associative with a size of 1024 entries and page walk latency of 60 cycles. Xen is

booted on this ”physical machine” and TPCC-UVa, running on a domU, is simulated in

three different simulator configurations, 1. Just Simics without FeS2 or the timing TLB

model (Only Simics) 2. Simics with FeS2 plugged in, but without the TLB timing model

(Simics + FeS2) and 3. Simics with FeS2 and the TLB timing model (Simics + FeS2 +

TLB Timing Model). The length of the simulation is varied from 1 million to 1 billion x86

instructions. These simulations are run on an IBM system X tower server with two Intel

49

Xeon 2GHz cores and 7GB memory running 32-bit Linux 2.6.22.6 with PAE support. On

this machine, the simulations are run till the specified number of x86 instructions are

committed and the wall clock time for the run is noted. From these, the throughput of

the simulation framework is computed and used to quantify the speed of the simulation

framework, which are presented in Figure 3-5.

1

10

100

1000

1M In

strs

10M

Inst

rs

100M

Inst

rs

1B In

strs

1M In

strs

10M

Inst

rs

100M

Inst

rs

1B In

strs

1M In

strs

10M

Inst

rs

100M

Inst

rs

1B In

strs

Only Simics Simics + FeS2 Simics + FeS2 + TLB

Timing Model

Thro

ughp

ut o

f Sim

ulat

ion

Fram

ewor

k in

KIP

S (L

og S

cale

)

Figure 3-5. Throughput of the simulation framework for uniprocessor simulations withvirtualized TPCC-UVa workload. The throughput of the simulation ismeasured as the x86 instructions retired per second of wall clock time and ispresented as Kilo Instructions per Second (KIPS). The speed of simulation isreduced by 10× when FeS2 and TLB timing models are used compared tothe throughput in purely functional mode with Simics.

As discussed in Section 3.2.1, Simics is primarily a functional level simulator and

does not provide timing models for the TLB. Hence, the throughput achieved by using

just Simics is quite high, of the order of 0.1 millions of simulated instructions per second,

as seen from Figure 3-5. Moreover, the throughput increases with the total number of

x86 instructions simulated. This increase is caused by the amortization of the startup

costs of the simulation (such as setting up the data structures representing various

50

microarchitectural components), which do not contribute towards the throughput, over

the longer runs. For simulations involving 1 billion instructions and more, the throughput

achieved is close to 0.7 million instructions per second.

The slowdown in the throughput by using FeS2 is considerable, even when the TLB

timing model is not used (Simics + FeS2), as can be seen from Figure 3-5. Even for long

running simulations of 1 billion instructions, this slowdown is as much as 30 times and

the throughput achieved is only about 23000 x86 instructions per second. This slowdown

is further compounded when the TLB timing model is used and lowers the throughput to

about 12000 x86 instructions per second.

The throughput of the simulation framework for multiprocessors simulations, where

the simulated machine has more than one CPU, is also examined by simulating an

x86 machine with 3 GB of memory and two user domains running TPCC-UVa and

Vortex. The number of CPUs in this simulated machine is varied between 2 and 8.

For greater fidelity of simulation, Simics is set to simulate 1 x86 instruction on a CPU

before it switches to the next in round robin fashion. The throughput of these simulations

are presented in Table 3-2. For brevity, only the speeds for long running (1 billion x86

instructions) are presented.

The high-frequency switching between the simulated CPUs causes a high overhead

degrading the throughput on increasing the number of CPUs, even when FeS2 and the

TLB timing model are not used. For instance, the speed of a 2 CPU simulation is only

a third of the 1 CPU simulation. When FeS2 and the TLB timing model are used, the

simulation speed further reduces and is about 10× smaller than the speed without FeS2.

3.5 Using the Framework to Investigate TLB Behavior in Virtualized Platforms

One of the motivating factors for developing the simulation framework is to

understand the TLB behavior in virtualized scenarios and quantify the impact of the

TLB on the performance of virtualized workloads. To achieve this, consolidated and

unconsolidated workloads consisting of the applications described in Section 3.3.1

51

Table 3-2. Throughput of the simulation framework for multiprocessor x86 simulations# CPUs insimulatedmachine

Simulator Configuration Simulated KIPS

1Only Simics 667.99Simics + FeS2 23.08Simics + FeS2 + TLB Timing Model 13.84




are simulated using the framework described in Section 3.2. Three different metrics

are used to characterize the TLB behavior for a workload : 1. the number of flushes

2. the ITLB an DTLB miss rates and 3. the impact of the TLB misses on the workload

performance. Each of these metrics characterize the TLB behavior at different

granularities and are used to illustrate key insights into the behavior of the TLB for

virtualized scenarios.

The Simics simulated machine in all the experiments in this chapter is configured

to have one CPU and an untagged TLB. In these simulations, the values of parameters

not related to the TLB, such as the pipeline width and cache sizes, are maintained at

FeS2’s default values shown in Table 3-3. The TLB size is selected to cover both the

range of existing TLB sizes found in modern x86 processors as well as larger sizes. As

mentioned in Section 3.2.4, the value of the page walk latency is determined to be 60

cycles based on RMMA experiments on Intel Core2 Duo processor. However, since the

page walk latency (PW) will have an effect on RIPC , a range of latencies from 30 cycles

to 90 cycles is used for the simulations.

52

Table 3-3. Simulation parameters for investigating TLB behavior on virtualized platformsParameter ValuesNumber of Processors 1TLB Size 64, 128, 256, 512, 1024TLB Associativity 8TLB Page Walk Latency (PW) 30 - 90 cycles

L1 Cache Size 8 MBL1 Cache Miss Latency 8 cyclesL2 Cache Size 32 MBL2 Cache Miss Latency 100 cycles

Pipeline Fetch Width 4Pipeline Rename Width 4Pipeline Execute Width 4Pipeline Retire Width 4Memory Width 2

Length of Simulation 1 billion x86 instructions

3.5.1 Increase in TLB Flushes on Virtualization

The disadvantage of virtualization, with respect to the Translation Lookaside

Buffer, is that it increases the number of processes which share the TLB, which

raises the number of context switches between these spaces. By the very nature of

hardware-managed TLBs, consistency is maintained during these context switches

by flushing the TLB, resulting in a large number of TLB flushes and subsequent TLB

misses. The increase in number of flushes is further compounded by the virtualization

requirement that certain privileged instructions (such as I/O and page table updates)

have to be trapped and executed by the hypervisor or the virtual machine monitor

(VMM), even though they are issued by the virtual machine (VM). Conforming to this

requirement causes switches between the VM and the VMM which further increases

TLB miss rate.

The comparison of the number of flushes obtained for the virtualized and non-virtualized

workloads is shown in Figure 3-6. As explained in Section 3.3.1, TPCC-UVa consists of

many processes and the context switches between these processes flush the TLB quite

53

0

1

2

3

4

5

6

7

8

Linux Xen Linux Xen Linux Xen Linux Xen

TPCC dbench SPECjbb Vortex

TLB

Flu

shes

per

Mill

ion

Inst

ruct

ions

Figure 3-6. Increase in TLB flushes on virtualization. Comparing the TLB flushes innon-virtualized and virtualized platforms reveals a 7× to 10× increase in thenumber of flushes for virtualized workloads.

frequently. Hence, TPCC-UVa exhibits a large number of flushes per instructions, even

when it runs in a non-virtualized system. But the frequency of TLB flushes increases

by almost 10× on virtualization. A similar behavior is seen for the dbench workload, as

that is I/O intensive in nature as well. This behavior is due to the I/O component of these

benchmarks which requires switching between the dom on which the application runs

and dom0 which contains the I/O back end drivers. On the other hand, when SPECjbb

and Vortex are considered, the number of flushes, while still larger on virtualized

platforms than on Linux, is smaller compared to the I/O-intensive TPCC-UVa or dbench.

In these applications, the ratio of the flushes on virtualized and non-virtualized scenarios

is smaller than the I/O-intensive benchmarks.

3.5.2 Increase in TLB Miss Rate on Virtualization

The effect of the TLB being flushed more frequently is that the lifespan of the TLB

entries reduce to the order of a few hundred thousand cycles, causing a big barrier for

54

0

0.5

1

1.5

2

2.5

3

3.5

4

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

64-

entry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024-entry

TLB

64-

entry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024-

entry

TLB

TPCC dbench

Mis

se

s p

er T

ho

us

an

d I

ns

tru

ctio

ns

(M

PK

I)

A DTLB Miss rate for I/O-intensive workloads

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

64-

entry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024-entry

TLB

64-

entry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024-

entry

TLB

SPECjbb VortexM

iss

es

pe

r T

ho

us

an

d I

ns

tru

ctio

ns

(M

PK

I)

B DTLB Miss rate for memory-intensive work-loads

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

64-

entry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024-entry

TLB

64-

entry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024-

entry

TLB

TPCC dbench

Mis

se

s p

er T

ho

us

an

d I

ns

tru

ctio

ns

(M

PK

I)

C ITLB Miss rate for I/O-intensive workloads

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

64-

entry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024-entry

TLB

64-

entry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024-

entry

TLB

SPECjbb Vortex

Mis

se

s p

er T

ho

us

an

d In

stru

ctio

ns

(M

PK

I)

D ITLB Miss rate for memory-intensive work-loads

Figure 3-7. Increase in TLB miss rate on virtualization. Comparing TLB miss rates innon-virtualized and virtualized platforms shows significantly larger miss ratesfor the virtualized workloads.

55

improved VM performance [14]. The impact of this increased number of flushes can

be understood by examining the miss rates for the non-virtualized applications and

contrasting them to their virtualized counterparts. In Figure 3-7, the number of TLB

misses per thousand instructions (MPKI) for all four workloads, both virtualized and

non-virtualized scenarios, are presented. When the change miss rates with increasing

TLB sizes is observed, it is seen that the DTLB miss rates for TPCC-UVa on Xen

reduces till about 256 entry TLB and then become constant at 0.5577 misses per

thousand instructions. On the other hand, the virtualized SPECjbb and Vortex show a

constantly reducing trend in the DTLB miss rates with increase in TLB sizes. It is also

clear that the DTLB miss rate on Xen is 1.5× to 5× larger than on Linux for a large

TLB of size 1024 entries. This virtualization-driven increase in ITLB miss rates is even

larger, and for SPECjbb and Vortex, is as large as 70× for 1024 entry TLB. Thus, this

experiment clearly shows the significantly larger TLB misses on a virtualized platform.

Depending on whether the page walk hist or misses in the cache, the cost of every TLB

miss may be the time taken for a few RAM accesses, i.e., upwards of a few hundred

cycles.

3.5.3 Decrease in Workload Performance on Virtualization

To estimate the impact of the TLB on the performance of a workload, the workload

is simulated in two different configurations. In the first configuration, Simics is configured

to use FeS2 but not the TLB timing model, thereby capturing the behavior of an

”ideal TLB” with zero latency for TLB lookups and a 100% TLB hit rate. The workload

Instructions per Cycle (IPC) obtained from this simulation corresponds to an ”ideal IPC”

where the TLB is not realistically modeled. This IPC value represents the maximum IPC

that could potentially be achieved by any improved TLB design. Then, the framework is

configured to run Simics with FeS2 and the ”regular TLB timing model” (finite capacity

and non-zero page walk latency) and the ”realistic IPC” of the workload is obtained. The

56

difference in the IPC values of both these configurations gives an estimate of the TLB’s

influence in determining the performance of the workload.

The metric RIPC shown in Equation 3–2, which is a ratio of the difference in the

realistic and ideal IPCs to the ideal IPC, expressed as percentage value, is used to

gauge the impact of the TLB timing model. The higher the value of RIPC , the farther

the IPC obtained using the realistic TLB timing model deviates from the ideal IPC

value. Hence, the RIPC captures the TLB-induced delay in the performance of the

workload. Any improvement in the TLB architecture which reduces the TLB-induced

delay will lower the RIPC value and therefore, RIPC may also be used as a figure of merit

to compare various TLB improvement schemes. Moreover, RIPC may also be used as

an estimation of the deviation of IPC from a realistic IPC, when simulation frameworks

are used to study the characteristics of virtualized workload without accounting for TLB

timing. Thus, a large RIPC for a workload emphasizes the criticality of modeling the

TLB behavior for accurately characterizing the performance of the workload. The RIPC

values from these simulations are shown in Figure 3-8 and Figure 3-9 for single and

consolidated workloads respectively.

RIPC = 100×(1− IPCRegular TLB

IPCIdeal TLB

)(3–2)

3.5.3.1 I/O-intensive workloads

TPCC-UVa uses a PostgreSQL database, which it reads from the disk as needed.

Thus, TPCC-UVa causes some disk I/O activity. The I/O drivers in Xen use a split

architecture, where the front-end driver on the domU uses the privileged back end

driver on dom0 to perform I/O. As a result, there are a large number of flushes caused

by the context switches between the domains. These I/O related context switches

causes TPCC-UVa on Xen to have a high TLB miss rate and, therefore, a large RIPC ,

as seen from Figure 3-8A. The RIPC is especially high at smaller TLB sizes, as seen

57

0

2

4

6

8

10

12

14

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

64-entry TLB 128-entry TLB 256-entry TLB 512-entry TLB 1024-entry TLB

RIP

C (

%)

A RIPC for TPCC-UVa

0

5

10

15

20

25

30

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90


RIP

C (

%)

B RIPC for dbench

0

5

10

15

20

25

30

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90


RIP

C (

%)

C RIPC for SPECjbb

0

5

10

15

20

25

30

35

40

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90


RIP

C (

%)

D RIPC for Vortex

Figure 3-8. Decrease in single-domain workload performance on virtualization.Performance is expressed using IPC and the decrease in performance usingRIPC . The RIPC for virtualized workloads is significantly larger, especially atlarger TLB sizes.

58

from its value of 8.5% and 12% for a 64-entry TLB with PW values of 60 an 90 cycles

respectively.

One advantage of a full-system simulator, such as the one presented in this work, is

that the O/S and the software stack is simulated in addition to the workload application.

Hence, access to performance monitoring tools like top [87, 88] is readily available.

Using top, the memory usage of TPCC-UVa is estimated to be about 50MB memory.

Thus, in addition to the TLB misses driven by the I/O related context switches, some

of the TLB misses are also caused by this memory footprint and the lack of sufficient

space in the TLB to accommodate all the entries. When the change in RIPC with TLB

size is observed, it can be seen that increasing the TLB size up to 256 entries from 64

entries reduces RIPC from 8.5% to 5.5%, a 35% reduction in the TLB-induced delay. This

is because a larger TLB is able to accommodate more entries, thereby avoiding the TLB

misses and page walk delays which arise due to the lack of TLB capacity. However,

increasing the TLB size beyond 256 entries does not reduce the RIPC significantly.

Beyond 256 entries, the dominant cause for TLB misses is the repeated flushing of the

TLB and not TLB size limitations.

On comparing the RIPC values for Linux and Xen, a nine-fold increase in RIPC is

observed in the virtualized scenario, primarily due to the I/O activity of TPCC-UVa.

At a TLB size of 1024 entries, the RIPC for TPCC-UVa on Linux is between 0.35%

and 0.9% depending on the value of the page walk latency. Corresponding values for

the virtualized TPCC-UVa lie in the range of 3.6% to 8%. This clearly underlines the

increased impact of the TLB on the workload performance in virtualized scenarios

and the importance of modeling the TLB timing behavior when simulating virtualized

workloads.

The trend of very large RIPC values for the virtualized version of the workload is

observed for dbench also, as dbench is another I/O-intensive workload. In fact, since

dbench is the most I/O-intensive of all four workloads, it exhibits the highest increase

59

in RIPC as a result of virtualization. The RIPC on Xen for a 64 entry TLB is more than

10× that of dbench running on Linux. Similar to TPCC-UVa, increasing the TLB size till

256 entries lowers the TLB misses and thereby the RIPC of virtualized dbench to some

extent. Beyond this point, however, the reduction in RIPC is limited, for the same reasons

as TPCC-UVa, i.e., flushing of the TLB becomes the dominant cause of TLB misses.

While the trend is similar to TPCC-UVa, the actual RIPC values are larger for any given

TLB size and page walk latency than TPCC-UVa. It is also instructive to see that, on

increasing the size of the TLB, the RIPC values for Linux do not change significantly and

reduce only by 9%. This is in contrast to the 60% reduction shown by dbench on Xen.

3.5.3.2 Memory-intensive workloads

The values and the trends of RIPC for SPECjbb on Linux is quite different from both

TPCC-UVa and dbench, as seen in Figure 3-8C. Even when SPECjbb runs on Linux,

it runs inside a Java virtual machine which has a large heap size of 256MB. Moreover,

as explained in Section 3.3, SPECjbb caches the database in memory causing a wide

spread in the pattern of the memory pages it accesses. Both these factors cause

SPECjbb to exhibit high TLB miss rates, as reported by Shuf et al. [80]. Thus, even in

non-virtualized scenario, there is a significant RIPC for SPECjbb. In fact, at smaller TLB

sizes, the RIPC in Linux is close to that in Xen. For instance, with a 64 entry TLB, RIPC in

Linux is almost 80% that of Xen for 60 cycle page walk latency.

On increasing the TLB size, however, the additional RIPC due to virtualization

becomes pronounced. This is due to the inability of increasing TLB sizes to cope with

virtualization related context switches and the resulting TLB flushes. Even in a workload

like SPECjbb which is not predominantly I/O-intensive, the RIPC for 1024 entry TLB and

60 cycle page walk latency increases by two fold, compared to Linux.

Compared to TPCC-UVa and dbench, another notable difference is the scaling of

the RIPC values on increasing the TLB size, even beyond 256-entry TLB size. In fact,

compared to the value at 256 entries, the RIPC value for the virtualized SPECjbb reduces

60

by 11% for a 512 entry TLB and 16% for 1024 entry TLB. This behavior is due to the

memory-intensive nature of the workload. At these large TLB sizes, the contribution

of the TLB misses due to a lack of capacity in the TLB is still quite significant and

the workload is able to benefit from increased space in the TLB. Moreover, since

virtualization driven TLB flushes are not present in Linux, it can be observed that the

reduction in RIPC is more for Linux than for Xen.

Vortex, in spite of being a part of CPU intensive benchmark suite, also has a

significant memory usage of about 75MB. While the amount of memory it uses is lesser

than SPECjbb, its spread pattern of accessing pages causes it to have a miss rate

comparable to many Java based workloads [83]. Hence, the trend of the RIPC values

is similar to SPECjbb. The impact of virtualization is small at small TLB sizes, as seen

by the RIPC value on Linux which is about 90% of the value on Xen. As the TLB size

increases, the reduction of RIPC on Linux is much steeper than on Xen. At a TLB size

of 1024 entries, Vortex on Xen has an RIPC which is almost four fold that of Linux, for 60

cycle page walk latency.

While the trend in RIPC values are similar for SPECjbb and Vortex, one notable

difference is the magnitude by which they reduce on scaling up the size of the TLB.

Even for virtualized Vortex, the RIPC reduces by 70% on scaling the TLB size from 64

entries to 1024 entries, compared to the 40% reduction for virtualized SPECjbb.

3.5.3.3 Consolidated workloads

To study the TLB behavior for consolidated multi-domain workloads, two consolidated

workloads TPCC-UVA SPECjbb and TPCC-UVA dbench are created, using the method

outlined in Section 3.3.2. In these workloads, both component applications time share a

single CPU for the length of the simulation, i.e., 1 billion instructions. These workloads

are simulated using FeS2, and the RIPCs are plotted, as shown in Figure 3-9.

From Figure 3-9A, it can be seen that the values and the trends of the RIPCs

for TPCC-UVA SPECjbb are a combination of the individual values and trends for

61

0

5

10

15

20

25

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90


RIP

C (

%)

A RIPC for TPCC-UVa consolidated withSPECjbb

0

5

10

15

20

25

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

Lin

ux

Xe

n

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

64-entry TLB 128-entry TLB 256-entry TLB 512-entry TLB 1024-entry TLBR

IPC (

%)

B RIPC for TPCC-UVa consolidated with dbench

Figure 3-9. Decrease in consolidated workload performance on virtualization.Performance is expressed using IPC and the decrease in performance usingRIPC . The RIPC for virtualized workloads is significantly larger. The trend inRIPC for consolidated workload is a combination of the values and trends ofthe RIPC of the component applications.

TPCC-UVa and SPECjbb. As an example, at a TLB size of 64 entries and a page walk

latency of 60 cycles, the increase in RIPC due to virtualization is 1.45×, 8.66× and 1.26×

for the consolidated workload, TPCC-UVa and SPECjbb respectively. This behavior

is due to the fact that, because of equal priorities in the scheduler, these applications

time share the TLB causing the resulting behavior to be a combination of both the

individual applications. It can also be seen that the actual values of the TLB RIPCs are

between those of component applications for all TLB sizes and page walk latencies.

A similar behavior is seen for TPCC-UVa dbench as shown in Figure 3-9B. Since all

the workloads are I/O-intensive, the increase in RIPC due to virtualization is quite large,

irrespective of TLB size. In fact, the ratio of RIPC values on Xen and Linux is in the range

62

Table 3-4. Impact of Page Walk Latency on TLB-induced performance reduction RIPC

PW RIPC (%)Latency TPCC-UVa on Xen SPECjbb on Xen(Cycles) 64 256 1024 64 256 102430 4.63 3.18 3.06 13.49 9.79 8.3960 8.48 5.60 5.41 21.54 14.94 12.4890 12.14 7.99 7.73 28.48 19.69 16.28180 21.67 14.53 14.05 43.92 31.78 26.58270 29.33 20.21 19.58 54.00 40.85 34.69

of 10× to 6× for the consolidated workloads. The trend of the RIPC on Xen, when scaling

up the TLB sizes, also exhibits the behavior of the component applications and tapers

off beyond 256 entries.

From these observations, it is clear that

• Independent of whether the virtualized workload is I/O or memory-intensive, theTLB play a significant role in determining the performance of virtualized workloads.The impact of the TLB ranges from as low as 1% to as much as 35% depending onthe TLB size.

• The importance of the TLB in determining the performance of workloads in avirtualized scenario is significantly larger than in non-virtualized environments.In fact, for I/O-intensive workloads, the influence exerted by the TLB on theperformance can be as much as 9 times greater for virtual than non-virtualsettings.

• For consolidated workloads, the RIPC trends are a combination of the individualworkloads and exhibit a significantly larger RIPC on virtualize platforms than insingle-O/S scenarios.

and not using TLB timing models will cause the IPC values to have large deviations from

realistic values.

3.5.4 Impact of Architectural Parameters on TLB Performance

One of the virtualization extensions to the x86 hardware is the introduction of

Nested Page Tables (NPT) [36] or Extended Page Tables (EPT) [37], where the VMs

can handle page table updates without the help of the hypervisor. While this approach

reduces the overhead of switching between the hypervisor and VM, it increases the cost

of a TLB miss significantly, as described in Section 2.3.3. To investigate the impact of

63

the larger PW values on RIPC , TPCC-UVa running on the domU of a 1-CPU machine

is simulated with the ideal as well as regular TLB model for page walk latencies of

180 and 270 cycles and the RIPC values are calculated. Similarly, the RIPC values for

memory intensive SPECjbb are also determined for these large PW values. From these

RIPC values tabulated in Table 3-4, it can be seen that the impact of the TLB on the

workload performance is significantly larger at large PW values. RIPC for virtualized

TPCC-UVa increases by about 6.3× on increasing the PW from a 30 cycles to 270

cycles. A similar increase of four times is observed in the case of virtualized SPECjbb.

This underscores the importance of the TLB and incorporating detailed TLB timing

models while characterizing virtualized workloads for modern platform architectures with

multi-level page tables.

0

2

4

6

8

10

12

14

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

FW 2

FW 4

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90

PW

30

PW

60

PW

90


R IPC (

%)

Figure 3-10. Impact of the pipeline fetch width (FW) on TLB-induced performancereduction. Performance reduction is expressed using RIPC . The interactionbetween the TLB and architectural components such as the pipeline can becaptured only by using a TLB timing model as in this simulation framework.

Another advantage of having timing models for the TLB is the ability to study the

performance impact of various architectural changes on the workload performance

and RIPC , even when the said change is not in the TLB. To investigate the effect of one

such parameter, i.e., the width of the fetch stage of the pipeline, virtualized TPCC-UVa

64

is simulated with two different fetch widths of 2 and 4 for multiple TLB sizes and page

walk latencies. The IPCs from these simulations are used to determine the RIPCs which

are shown in Figure 3-10. From this, it can be seen that narrowing the fetch part of

the pipeline reduces RIPC quite significantly. With a narrower stream of instructions,

there is reduced pressure on the TLB, and thereby smaller number of TLB related

stall cycles. For instance, the RIPC for 64 entry TLB and 60 cycle page walk latency is

almost a third smaller for 2-wide fetch stage than for a 4-wide fetch stage. This trend

is seen irrespective of the TLB size. It is also interesting to note that the reduction in

RIPC by narrowing the fetch stage is less pronounced at larger page walk latencies as,

the TLB-induced delay increase in comparison with the stall cycles caused by the rest

of the system at large PW values. Thus, it is clear that using a timing model will help

understand the impact of various non-TLB architectural parameters on the TLB behavior

workloads.

3.6 Summary

In this chapter, a full-system simulation framework based on Simics and FeS2

and incorporating detailed TLB functional and timing models is developed and used to

investigate the TLB-induced delay for I/O-intensive, memory-intensive and consolidated

workloads. The impact of the TLB on workload performance is found to depend on the

TLB size as well as the value of the page walk latency. For typical server workloads,

the performance of the workloads is reduced by 8% to 35% due to the increased

TLB flushes and misses on virtualized platforms. It is also seen that the TLB-induced

performance degradation, especially for TPCC-UVa and dbench, are as much as 7× to

8× for the virtualized workload, compared to non-virtualized scenarios.

65

CHAPTER 4A TLB TAG MANAGEMENT FRAMEWORK FOR VIRTUALIZED PLATFORMS

While virtualization based server consolidation offers advantages such as

effective, flexible and controllable use of server resources, the workloads running in

such virtualized platforms experience lower performance than their non-virtualized

counterparts. One significant source of this performance degradation, as seen in

Chapter 3, is the high frequency context switch-related flushing of the Translation

Lookaside Buffer which increases the TLB miss rate and page walks to service these

misses, thereby reducing the performance of the virtualized workloads. Reducing this

TLB-induced performance degradation is an important challenge in virtualization.

4.1 Current State of the Art in Improving TLB Performance

Hardware managed TLBs , such as the x86 TLB, get completely flushed on context

switches to ensure consistency of the entries and prevent the entries of one process’s

address space being used for another process. This repeated flushing causes TLB

lookups to miss, necessitating high-latency page walks and thereby reduces the

workload performance. However, if the TLB entries are identified as belonging to a

specific address space by using a tag, then the TLB need not be flushed on context

switches.

Avoiding TLB flushes by tagging the entries with address space identifiers is a

well-established technique in software managed TLBs [26, 27, 89]. The use of TLB tags

on Itanium [90] as well as on PowerPC [91–93] have also been investigated. Prior to

the advent of virtualization, however, tagging of the entries in the hardware-managed

x86 TLB was not exhaustively studied. This was primarily due to the reason that the

frequency of the hardware-managed x86 TLB being flushed is small in non-virtualized

cases, about once every million cycles, and is not a major source of performance

degradation.

66

On the other hand, TLB flushes and the resultant TLB-induced delay cannot be

ignored on virtualized systems as evident from Section 3.5. The introduction of TLB

tags and a hardware-based tag-checking mechanism as a part of the virtualization

extensions, such as AMD-SVM [36, 48] and Intel VPID [37], is clearly a nod to the

importance of the TLB on virtualized platforms. In AMD-SVM [48], each TLB entry has

a 6 bit Address Space ID (ASID) as a part of its entry. Currently, Xen on AMD-SVM [94]

uses ASID 0 for the hypervisor or the Host mode. As long as the CPU is in Host mode,

the TLB entries are tagged with ASID 0. When the CPU switches to Guest mode, the

TLB is not flushed, but the ASID is changed from 0 to the ASID of the guest VM. Thus,

any TLB entry belonging to the hypervisor will not be declared a hit for a Guest-initiated

TLB lookup, as the ASID tags will not match. Avoiding TLB flushes using ASID tags

is found to have about 11% reduction in the overall runtime of kernbench, a kernel

compiling workload [94]. Similarly, in the Intel Nehalem [37], the TLB entries are

tagged with a per-VM Virtual Processor Identifier (VPID). Intel platforms such as the

Westmere [28] support PCID, a process-specific tag which is assigned and managed by

the system software. Tickoo et al. [18] also explore TLB tagging in their qTLB approach,

where the TLB entries belonging to the hypervisor, which are global within a domain, are

not flushed during a switch from one domain to another.

The primary intent of these efforts is to make the switching between VMs more

efficient by avoiding a TLB flush. However, using VM-specific tags1 can avoid only a

subset of the context switch related TLB flushes compared to using process-specific

tagging. In addition to this, while a software-transparent hardware-only scheme is

desirable for hardware-managed TLBs to keep in line with the ”hardware-managed”

1 In this dissertation VM-specific tags are alternatively referred to as domain-specifictags, dom-specific tags, per-VM tags or per-domain tags

67

design philosophy, the system software is involved in avoiding TLB flushes in all these

approaches including the PCID architecture [28].

To meet these requirements, the Tag Manager Table (TMT) is proposed in this

dissertation. The TMT is a low-latency architecture which derives a tag from the

PTBR (CR3 register in x86) in a software-transparent manner and uses it to tag the

TLB entries. Such an approach significantly reduces TLB miss rates and the number

of TLB flushes, compared to using only VM-specific tags. The impact of the TMT is

investigated, in terms of the reduction in TLB flushes, TLB miss rate and TLB-induced

performance reduction using the full-system simulation framework developed in

Chapter 3. The influence of various hardware design parameters and workload

characteristics on this impact of using the TMT is analyzed. The use of the TMT in

enabling shared Last Level TLBs is also presented.

4.2 Architecture of the Tag Manager Table

VM-specific TLB tagging, as seen in qTLB [18], is aimed at avoiding the hypervisor

entries being flushed when there is a context switch between two VMs, termed as

Inter-VM switch. However, these tags do not prevent the TLB being flushed if there is

a context switch between two processes within the same VM, i.e., an Intra-VM switch.

By choosing tags that associate the TLB entries with a particular process’s address

space rather than a particular VM, it is possible to avoid TLB flushes triggered due

to all types of context switches. Furthermore, it is important that the tagging solution

for hardware-managed TLBs preserves the hardware-based TLB management with

minimal or no software involvement. These two requirements dictate the design of the

Tag Manager Table.

One potential tag which conforms to these requirement is the Page Table Base

Register (PTBR) which is stored in a hardware register (CR3 register in the case of

the x86 architecture). Since every process has a unique set of page tables, the value

in CR3 register is unique for every address space and the contents of the CR3 can be

68

obtained without a high latency interaction with the system software stack. Hence, the

TLB entries may be tagged with the CR3 value to identify the process or virtual address

space to which they belong. However, the size of the CR3 register is quite large (32 or

64 bits); tagging the TLB entry with the CR3 increases the die area as well as the energy

expenditure for the TLB lookup. Hence, the Tag Manager Table (TMT) is proposed to

achieve this software-transparent process-specific tagging with minimal overheads.

The TMT, shown in Figure 4-1, is a small, fast, cache implemented at the same level

as the TLB. Every TLB in the platform has an associated TMT. Each entry in the TMT

represents the context of a process and has three fields:

• The CR3 field, which contains the value of the CR3 register, a per-process uniquepointer to the page tables for the process.

• The Virtual Address Space Identifier (VASI), which stores a unique identifierassociated with the address space of the process. The VASI is generated asa function of the CR3 in a software-transparent manner. Any function whichguarantees that all entries in the TMT have different VASIs, such as a perfecthash or the CR3 masked with an appropriate bitmask, can be used. In the workpresented here, the position of the entry in the TMT is used as the VASI. Forinstance, the VASI for the first entry in the TMT is 0, the second entry is 1 andso on. This simple scheme eliminates the need for a complex hash function or abitmask while guaranteeing a unique VASI for every TMT entry.

• The Sharing ID (SID) field, which stores the identifier of the sharing class to whichthe process belongs. The SID is needed only for controlling the sharing of theTLB and can be left unassigned in the case of uncontrolled sharing as in all theexperiments performed in this chapter. The selection of the sharing classes andthe use of the SID are discussed in detail in Chapter 5.

The SID and the VASI together constitute the ”CR3 tag”2 . Tagging the TLB entries

with the CR3 tag instead of the CR3 itself results in a lower area overhead. For instance,

with an 8-entry TMT and a 3-bit SID, the CR3 tag is only 6 bits compared to the 32 or 64

bit CR3. The TMT architecture also consists of a Current Context Register (CCR). The

2 In this dissertation the CR3 tag is also referred to as process-specific tag orper-process tag.

69

CCR is a register with the same size as the Tag Manager Table entry, which caches the

CR3, SID and the VASI for the current context.

CR3

TAG

CR3

CR3 SIDTAG

MANAGER

TABLE

CR3 SID

VASI

VASI

VASISID

TLB

VPN

CCR

PPN

MOV CR3

TRIGGERS

FLUSH

1

LOOKUP NEW

CR3 IN TMT2

3

2LOOKUP NEW

CR3 IN CCR

TLB FLUSHED ONLY IF NO

FREE SLOT IN TMT OR FOR

CONSISTENCY FLUSH

VIRTUAL ADDRESS

PAGE OFFSETVPN

Figure 4-1. TLB flush behavior with the Tag Manager Table. In step ..1 a value is writtenin CR3 prompting a flush. In step ..2 , the TMT is searched for the new CR3.Simultaneously the new CR3 is compared to the current CR3 in the CCR.The TLB and the TMT are flushed if the new CR3 matches with the CCR, orif the new CR3 is inserted into the TMT after evicting an exiting entry. This isshown in step ..3 .

4.2.1 Avoiding Flushes Using the Tag Manager Table

Whenever there is a context switch from process P1 to P2, a TLB flush is triggered

by the ”MOV CR3” instruction which updates the value of the CR3 register, as shown

in step ..1 of Figure 4-1. On the triggering of the TLB flush, the TMT is searched

to determine if the CR3 value of P2 already exists as shown in Figure 4-1, step ..2 .

Simultaneously, the new value being written into the CR3 is compared with the current

CR3 value from the CCR. If the new CR3 value is different from the current CR3 value,

it is deduced that the TLB flush was triggered by a context switch. The TMT is searched

for the new CR3 value. If it exists in the TMT, that TMT entry is copied into the Current

70

Context Register. On the other hand, if the CR3 value of P2 is not found in the TMT, it

is inserted into a free slot in the TMT and a VASI assigned to it. Then, this TMT entry is

copied into the CCR. Once the CCR is populated with the CR3 and the tags of P2, any

TLB lookup will hit only if the TLB entry belongs to P2 and matches the tags in the CCR.

Thus, in both these cases, updating of the CCR is equivalent to flushing the TLB and the

actual TLB flush can be avoided.

A situation may arise during a context switch from P1 to P2 where the CR3 of P2 is

not in the TMT and, due to limited capacity, there are no free entries in the TMT. In this

case a victim TMT entry, (CR33,SID3,VASI3) belonging to P3, in accordance with First

In First Out (FIFO) replacement policy. The CR3 and SID values of P2 replace CR33 and

SID3 while VASI3 is reused for P2. To avoid the TLB entries of P3 being used for P2, the

entries with the VASI3 are flushed, as seen in Figure 4-1, step ..3 . This flush, caused by

the lack of capacity in the Tag Manager Table, is termed a Capacity Flush.

Since the latency for examining every TLB entry and flushing only those entries

with tag VASI3 may be prohibitive, the capacity flush is implemented as a full TLB flush.

However, the downside of such an implementation is the eviction of TLB entries whose

tags are not VASI3, thereby potentially increasing the TLB miss rate. Moreover, ISA

extensions [28] for flushing entries with a specific tag and the hardware to implement

this instruction without a prohibitive latency are being introduced in modern processors.

With such extensions, the capacity flush may be implemented as a selective flush and

not result in the entire TLB being flushed.

Apart from context switches, TLB flushes may also be triggered by changes in the

page tables. Whenever page tables are modified, any entry cached in the TLB which

is affected by this change should be flushed from the TLB to maintain consistency

between the TLB and the page tables. In both non-virtualized (Linux) and virtualized

(Xen) systems, consistency is maintained by flushing the entire TLB. On examining the

source code of both Linux and Xen, it is found that this flush is effected by a two step

71

process. The current value in the CR3 register is read in the first step and the same

value is written into the CR3 register using a ”MOV CR3” instruction in the second step.

Even though no change of context is involved, this ”MOV CR3” instruction still triggers a

flush of the TLB. Such flushes are called Forced Flushes.

The TMT is designed to recognize these Forced Flushes. As seen in step ..2 of

Figure 4-1, whenever a new value is written into the CR3, it is compared with the current

CR3 value from the CCR. If both of them are the same, this flush is deduced to be a

Forced Flush and the TLB is flushed completely, as depicted in Figure 4-1, step ..3 .

Whenever the TLB is force flushed, the TMT is also flushed to free the slots being

occupied by contexts none of which have any entries in the TLB. This behavior is shown

in Figure 4-1.

4.2.2 TLB Lookup and Miss Handling Using the Tag Manager Table

CR3

CR3 SIDTAG

MANAGER

TABLE

CR3 SID

VASI

VASICCR

VPN MATCH1

2 CR3 TAG MATCH3 TLB HIT

VIRTUAL ADDRESS

PAGE OFFSETVPN

CR3

TAG

VASISID

TLB

VPN PPN

Figure 4-2. TLB lookup behavior with the Tag Manager Table. In step ..1 a possiblematch is found in the TLB, by comparing the VPN of the virtual address withthe TLB entries. In step ..2 the VASI from the TLB entry and the VASI fromthe CCR are compared. The TLB lookup results in a hit only if both theVPNs and the VASIs match, as in step ..3 .

72

The TLB lookup happens as shown in Figure 4-2. The TLB is searched for any

entry which has the same VPN as the virtual address. Simultaneously, the VASI of the

current context is looked up from the CCR. The entry is declared as a hit only when its

VASI matches the VASI in the CCR and the VPN is the same as the VPN in the virtual

address being looked up. Since the CCR is dedicated register, the VASI can be looked

up with minimum latency. It should be noted that the comparison of the VASI happens in

parallel with the VPN comparison, as shown in Figure 4-1. Thus no additional latency is

imposed by the TMT in the critical TLB lookup path. If the lookup results in a miss, the

page walk proceeds to determine the physical address from the page tables. Once this

translation is obtained, it is added in the TLB along with the CR3 tag (SID and VASI) of

the current context.

One issue with enabling TLB sharing, as with caches which are indexed using

virtual addresses, is aliasing [95]. Aliasing is the situation where the same translation

may be cached once for every process’s address space, thus creating multiple copies of

the same entry. Such situations arise typically with Global entries which are translations

for virtual addresses in the 3GB-4GB range belonging to the kernel. For instance, the

entries corresponding to the high memory range in Linux are valid in all process address

spaces and are marked using the Global bit in the TLB entries. To avoid multiple copies

of such Global entries with different VASI tags to exist, the TLB lookup logic is modified

to ”hit” when either the VASI of the entry matches with the VASI in the CCR or the if

the Global bit in the entry is set. This ensures that only one copy of Global entries are

cached in the tagged TLB.

While the preceding explanation of the TMT is for x86 processors without hardware

virtualization support, it can be used for processors with Extended/Nested page tables

(EPT/NPT) [36, 37, 48] with minor modifications. In a processor with EPT/NPT support,

as described in Section 2.3.3, both the guest and host CR3 values are cached in the

TMT entry ensuring that the CR3 tag will still be unique per process address space.

73

4.3 Modeling the Tag Manager Table

The TMT and the process-specific tagged TLB are modeled using the generic

tagged TLB simulation model described in Section 3.2. The functionality of the TMT

is mapped to the GMT module. Thus, the TLB flush on every MOV CR3 instruction is

intercepted by the GMT module which performs the necessary changes in the TMT and

uses the TMT functionality to decide whether this flush should be carried out or avoided.

Similarly, the CCR is mapped to the TagCache which gets updated on every MOV CR3

instruction. Since the TMT is designed to perform the tag comparison without imposing

any additional delay during the TLB lookup, the TLB lookup latencies are maintained at

the same values when simulating the regular TLB and the tagged TLB. The modeling of

the TMT is validated using the Functional Check mode described in Section 3.2.

4.4 Impact of the Tag Manager Table

In this section, the benefit of using the Tag Manager Table is evaluated using three

metrics, similar to those used in Section 3.5, namely: 1. the number of flushes 2. the

ITLB and DTLB miss rates, and 3. the increase in workload performance.

4.4.1 Reduction in TLB Flushes Due to the TMT

In a generic cache memory, the size of the cache is the main determinant of the

miss rate. When a workload begins to execute, there will be a few misses as the data is

brought into the cache for the very first time. Such misses are termed as cold misses.

Beyond this warmup phase, for an infinitely large cache, all the required data will be

contained in the cache and the hit rate will asymptotically reach 100%. However, the

situation is not the same for TLBs.

Apart from the size, one of the main determinants for the TLB hit rate is the

frequency at which the TLB is flushed. In TLBs where no tags are used, the hit rate

in the TLB is limited, because of shortened life span of the entries. Even in the case of

TLBs with unlimited size, the hit rate is still limited due to the periodic purging of the TLB.

The benefit of using an identifier to tag the TLB is in avoiding flushes and lowering the

74

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Intr

a-V

M

Inte

r-V

M

Fo

rce

d

Intr

a-V

M

Inte

r-V

M

Fo

rce

d

Intr

a-V

M

Inte

r-V

M

Fo

rce

d

Intr

a-V

M

Inte

r-V

M

Fo

rce

d


TL

B F

lush

es p

er M

illi

on

in

stru

ctio

ns

A Flush profile of different applications runningon a 1-CPU simulated machine. The intra-VMflushes are high when there are more than oneprocesses in a domain.

0

1

2

3

4

5

6

7

8

No

Tags

TMT No

Tags

TMT No

Tags

TMT No

Tags

TMT


TL

B F

lus

he

s p

er M

illi

on

in

stru

ctio

ns

B Reduction in TLB flushes on using an 8-entryTMT. More than 90% of the flushes are elimi-nated in cases where the Forced Flushes do notdominate.

Figure 4-3. Reduction in TLB flushes using an 8-entry TMT

miss rate in the TLB. If more flushes are avoided, the increased lifespan of the TLB entry

will result in higher hit rates. Thus, the reduction in the number of TLB flushes compared

to an untagged TLB is a coarse, yet intuitive figure of merit for understanding the impact

of the TMT.

TLB flushes occurring in virtualized scenarios can be classified, based on the cause

for the flush, into three types. The reason that the TLB has to be flushed can be either

a context switch or that the page table has been modified, as described in Section 4.2.

If the cause is a context switch, the two processes between which the switch happens

could be within the same VM or could be part of different VMs. Based on this a flush

can be classified into three categories: Intra-VM flushes caused by a Intra-VM context

switch, Inter-VM flushes caused by a domain to domain or Inter-VM context switch and

Forced flushes. This classification is called the flush profile and is a good indicator of

75

the gain that can be achieved by using a tagged TLB. For instance, if the forced flushes

dominate, then, irrespective of whether process-specific tags or domain-specific tags

are used, the TLB will still be frequently flushed. In such cases, the number of flushes

that can be avoided will be small, leading to smaller gains from using tagged TLBs. On

the other hand, using tags will reap significant benefits when the context switch flushes,

which can be avoided, dominate.

The flush profiles of the four workloads mentioned in Section 3.3.1, running on a

simulated x86 machine with one CPU and one domU, are presented in Figure 4-3A. It

can be observed that TPCC-UVa, which is a typical server workload, has a significant

number of context switch flushes, about 92%. Out of these, the number of intra-VM

and inter-VM context switch flushes are almost equal. However, in the case of

single-process workloads such as SPECjbb and Vortex, inter-VM flushes dominate

the profile compared to intra-VM flushes. Moreover, since the only activity performed by

dbench is file reads and writes, it is more I/O-intensive than TPCC-UVa. Hence, most of

the flushes it experiences are due to the transitions between domU and dom0 for access

to the privilege device drivers residing on dom0 and due to the forced flushes resulting

from the actual transfer of data to/from the disk. Thus, the intra-VM flushes constitute

only 2.5% of the total flushes for dbench.

The advantage of using the TMT and process-specific tags is that, both inter-VM

and intra-VM flushes can be avoided. From the reduction in the TLB flushes for these

workloads using an 8-entry TMT as shown in Figure 4-3B, it is seen that about 96%

of the flushes for SPECjbb and Vortex are avoided, even though the inter-VM flushes

dominate for these workloads. In cases where there are a substantial number of

intra-VM flushes, as in TPCC-UVa, almost 90% of the flushes are eliminated. If, on the

other hand, domain-specific tags were used, only about 50% of the flushed would have

been eliminated. The reduction in the TLB flushes is smaller only for dbench, where 35%

76

of the TLB flushes are forced flushes and are unavoidable. Even for this workload, the

elimination of context switch flushes reduces the total number of flushes by 65%.

Effect of the Tag Manager Table size

While the composition of the flushes determines the number of flushes that can

be avoided, the size of the TMT (the number of entries in the TMT) also influences this

reduction and is an important design parameter. The TMT size decides the number

of processes or address spaces that can concurrently share the TLB. If the size is

increased, additional processes can be represented in the TMT and the number

of capacity flushes (context switch flushes which could not be avoided due to lack

of capacity in a smaller TMT) can be reduced. On the other hand, increasing TMT

size causes the VASI to have a larger size and increases the die size as well as the

energy required for tag comparison. If the number of capacity flushes is already small,

increasing the TMT size will not result in commensurate reduction of the TLB miss rate.

Moreover, in cases where the size of the TLB entry tag is fixed, such as the 6 bits for the

AMD SVM [48], a smaller TMT results in a smaller VASI leaving free bits which may be

used to store metadata for TLB usage management. Hence determining the appropriate

TMT size is quite important.

To study the size tradeoffs for the TMT, the TPCC-UVa application is run on a

simulated x86 uniprocessor machine which has 256-entry, 8-way TLBs with CR3

tagging. The size of the TMT is varied from 0 entries (representing a situation with no

CR3 tagging) to 16 entries. For each TMT size, the number and type of flushes as well

as the reduction in TLB miss rates is observed. The results are shown in Figure 4-4.

Over 10 billion instructions of TPCC-UVa, there are 64738 flushes. Out of these,

5100 are forced flushes and the remaining 59638 are due to inter-VM and intra-VM

context switches. When the TMT size is 0 entries, every context switch causes

a capacity flush. Hence, the TLB is flushed 64738 times as seen from Figure 4-4.

However, with CR3 tagging and a 2-entry TMT, there is a substantial reduction in the

77

1

10

100

1000

10000

100000

0-entry TMT 2-entry TMT 4-entry TMT 8-entry TMT 16-entry TMT

Num

ber

of F

lush

es p

er 1

0B in

stru

ctio

ns (L

og

Sca

le)

0

5

10

15

20

25

30

35

40

45

50

Red

uctio

n (%

)

Capacity Flushes Forced Flushes DTLB MPKI Reduction ITLB MPKI Reduction

Figure 4-4. Effect of Tag Manager Table size on the reduction in number of flushes. Thenumber of flushes for TPCC-UVa for 10 billion x86 instructions is shown inthe left Y axis using a log scale. The reduction in DTLB and ITLB miss ratesfor a 256-entry 8-way TLB is shown in the right Y axis. While increasing theTMT size till 8 entries reduces the total number of flushes and the miss rate,further increase does not reduce the total number of flushes significantly andtherefore does not reduce the miss rate.

number of flushes from 64738 to 29484, as the number of capacity flushes reduce by

more than 50%. This reduces the miss rate, by about 25% for the DTLB and 30% for the

ITLB. Further scaling up the TMT size, however, gives diminishing returns and any size

beyond 8 entries does not substantially reduce the miss rate, even though the capacity

flushes are reduced. This is because, at TMT sizes larger than 8 entries, the dominant

type of flush is the forced flush and not the capacity flush. Even if the capacity flushes

are reduced by having a larger TMT, the forced flushes still periodically flush the TLB

limiting the lifetime of the entries. The simulations are repeated with SPECjbb, Vortex

and dbench. In all the cases, it is found that an 8-entry Tag Manager Table is sufficient to

ensure that the number of capacity flushes is much smaller than the forced flushes.

78

4.4.2 Reduction in TLB Miss Rate Due to the TMT

While the reduction in the number of flushes is a coarse metric and provides some

insight into the advantage of using process-specific tags, it is not sufficient to investigate

the benefit of the TMT thoroughly. For instance, if the flush profile for a workload is

such that it experiences no intra-VM flushes, using either process-specific tags or

domain-specific tags will avoid the same number of flushes. However, domain-specific

tagging solutions such as qTLB [18] can retain only the hypervisor’s TLB entries across

context switches. Using process-specific tags such as the CR3 tags can retain all

entries. Thus, though the same number of flushes are avoided, using process-specific

tags may result in lower TLB miss rate.

In order to capture such difference, the reduction in TLB miss rate when using the

TMT compared to using untagged TLB is used. This metric is quantified as Reduction,

as shown in Equation 4–1, and is expressed as a percentage of the untagged TLB miss

rate. The advantage of using Reduction is its high sensitivity to any TLB or TMT related

changes and insensitivity to changes in other architectural subsystems such as the

cache.

Reduction (%) = 100×(1− TLB miss rate with tags

TLB miss rate without tags

)(4–1)

The benefit of not flushing the TLB when switching from process P1 to P2 depends

on the amount of TLB that is being used by P2. If P2 requires a large TLB space, any

of P1’s entries which survived the TLB flush will still be evicted to make space for P2’s

entries. In such cases, the reduction in the TLB miss rate due to tagging will be very

small. Thus, the maximum benefit from tagging can be obtained when the TLB is large

enough to accommodate the entries of both P1 and P2. On the other hand, a large TLB

will consume valuable chip real estate which may be utilized better by other subsystems,

such as a larger L1 cache. Thus, the TLB size should be made sufficiently large to

optimize the reduction of the miss rate due to CR3 tagging, but no larger.

79

0

10

20

30

40

50

60

70

64

-e

ntr

y T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-e

ntr

y T

LB

64

-e

ntr

y T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-e

ntr

y T

LB

64

-e

ntr

y T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-e

ntr

y T

LB

64

-e

ntr

y T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-e

ntr

y T

LB

TPCC-Uva dbench SPECjbb Vortex

Red

uctio

n (

%)

A Reduction in DTLB miss rate

0

10

20

30

40

50

60

70

80

90

100

64

-e

ntr

y T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-e

ntr

y T

LB

64

-e

ntr

y T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-e

ntr

y T

LB

64

-e

ntr

y T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-e

ntr

y T

LB

64

-e

ntr

y T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-e

ntr

y T

LB


Red

uctio

n (

%)

B Reduction in ITLB miss rate

Figure 4-5. Reduction in TLB miss rate using an 8-entry TMT and 8-way associativity.Larger TLB sizes allows the caching of more TLB entries across contextswitches leading to a higher reduction in TLB miss rate.

To investigate the dependence of the benefit of using the TMT and the TLB size,

the I/O-intensive and memory-intensive workloads are simulated on a uniprocessor

Simics machine and the Reduction in DTLB and ITLB miss rates are plotted, as shown in

Figure 4-5. It should be noted that the miss rate used in these calculations is expressed

in Misses per Thousand Instructions (MPKI). From this, it can be seen that all workloads,

except dbench, show an increasing Reduction with TLB size. For instance, the Reduction

trend for TPCC-UVa shows that the DTLB miss rate for a 1024-entry tagged TLB is

65% smaller than the untagged TLB. On the other hand, even though dbench shows

some increase in the Reduction in DTLB MPKI with TLB sizes up to 256-entry TLB,

the TLB misses due to the lack of TLB capacity stop being the predominant source of

TLB misses and the repeated flushing of the TLB begins to dominate beyond these

TLB sizes, causing a plateau in the Reduction curve. Both Vortex and SPECjbb exhibit

80

Reduction curves with a high slope, even for 1024-entry TLB, indicating that further

increase in the TLB size may achieve even lower DTLB miss rate.

The Reduction trends for ITLB miss rate, shown in Figure 4-5B is markedly different

from the DTLB Reduction trends. The space required in the ITLB is smaller than the

DTLB space requirements due to the instruction memory footprint for these workloads

being smaller than the data memory footprint. As a result, the major difference from

the DTLB trend is that the reduction in ITLB miss rate is significantly larger for any

given TLB size that the reduction in the DTLB miss rate. Moreover, while the Reduction

in DTLB is low for SPECjbb and Vortex due to their memory-intensive nature, the

Reduction in ITLB miss rate is significantly high. It can also be observed that, in spite of

the small instruction memory footprint, the repeated forced flushing of the TLB causes

the Reduction in ITLB miss rate for dbench to be limited.

TLB associativity

Another important TLB design parameter is the associativity. Increasing the

associativity will reduce the conflict misses in the TLB. However, larger associativity

values necessitate more comparators in the TLB lookup hardware to match the

VPN, thereby increasing the area and power requirements. Hence, it is important to

understand the effect that the TLB associativity has on the reduction in miss rate due to

CR3 tagging.

On simulating TPCC-UVa, with an 8-entry TMT and tagged TLB of varying

associativity values, and plotting the Reduction trend, as shown in Figure 4-6A and

Figure 4-6B, it can be observed that the associativity has little effect on the Reduction.

There is some additional Reduction in the miss rate when the associativity is changed

from 4-way to 8-way, but any further increase in the set size does not vary the Reduction

by a large value. This analysis is also performed for the other workloads and similar

response to changing associativity is observed. Thus, by setting the associativity value

81

0

10

20

30

40

50

60

70


Re

du

ctio

n (

%)

4-way 8-way 16-way 32-way 64-way Fully-associative


0

10

20

30

40

50

60

70

80

64-entry TLB 128-entry TLB 256-entry TLB 512-entry TLB 1024-entry TLBR

ed

uctio

n (

%)

4-way 8-way 16-way 32-way 64-way Fully-associative


Figure 4-6. Effect of TLB associativity on the reduction in miss rate with an 8-entry TMT.While increasing the associativity from 4-way to 8-way shows someadditional increase in the reduction in TLB miss rates, higher associativityvalues do not make a significant difference.

at 8, the benefit of using the TMT can be obtained without a high area and power

overhead.

4.4.3 Increase in Workload Performance Due to the TMT

The most important end result of using the TMT is the improvement in the

performance of virtualized workloads. However, Reduction is not sufficient to understand

this improvement. To quantify this performance improvement, workloads are first run

on the framework described in Section 3.2 with an untagged regular TLB model, and

the Instructions per Cycle (IPC) (IPCRegular TLB) from this simulation is noted. Then, the

workloads are simulated using the tagged TLB model augmented with the TMT and

the IPC (IPCTMT TLB) is noted. The reduction in TLB misses on using the tagged TLB

is reflected in IPCTMT TLB being higher than the IPCRegular TLB and this Increase in IPC

82

(IIPC ), as shown in Equation 4–2, gives the impact of the TMT on the performance of

the workloads. The greater the number of TLB misses avoided by the TMT, larger is the

value of IIPC .

The theoretical maximum value of IIPC may be obtained when the TLB behaves

like an ideal TLB, as explained in Section 3.5.3, and experiences no TLB misses. In

this case, the TLB-induced delay, i.e., the latency due to TLB misses and subsequent

page walks, is completely eliminated. By simulating the workloads with an ideal TLB

model (no TLB misses and no latency due to page walks) and observing the IPC, this

maximum achievable IIPC (IPCIdeal TLB) can be obtained. Expressing the IIPC achieved

using the tagged TLB as a percentage of this maximum achievable IIPC , as shown in

Equation 4–2, gives the Impact Factor IF of the TMT. This IF gives an insight into the

performance benefit of the TMT architecture. For instance, an IF of 50% implies that

the TMT improves the IIPC by 50% of the increase achievable by any TLB architecture

(including the ideal TLB), or that the impact of the TLB delay on overall performance has

been reduced by 50%.

IIPC = 100×(

IPCTMT TLB

IPCRegular TLB

− 1

)IF = 100×

(IPCTMT TLB − IPCRegular TLB

IPCIdeal TLB − IPCRegular TLB

)(4–2)

Using IPC based metrics to understand the performance impact of TMT has the

advantage of being applicable to all types of workloads, especially when it is not feasible

to run the workload benchmarks to completion. Moreover, avoiding TLB misses will

reduce the time spent by the CPU waiting for page walks to complete and using IPC

is appropriate for estimating this reduction. However, it is also important to understand

the implications of using the TMT with user-observable performance metrics. For

this, SPECjbb is instrumented to indicate the completion of every transaction. This

83

number of transactions is used to estimate the throughput of SPECjbb and measure the

improvement in SPECjbb’s performance when the TMT is used.

To understand the improvement in performance due to the TMT, a single-CPU x86

machine is simulated using the framework described in Section 3.2 and the virtualized

workload is run on this x86 machine with either the ideal TLB, the regular TLB or the

tagged TLB with an 8-entry TMT. The IF and IIPC values for various TLB sizes and 8-way

associativity are calculated and presented in Figure 4-7.

As seen from Section 4.4.1, TPCC-UVa experiences approximately equal number

of inter-VM and intra-VM flushes and a much smaller number of forced flushes. Avoiding

these flushes using the TMT reduces the TLB miss rate and improves the IPC value, as

seen from Figure 4-7. Two factors which determine the TLB miss rate, and therefore the

delay due to TLB misses, are the TLB size and the frequency of TLB flushing. Figure 4-7

shows that scaling up the TLB size initially increases the IF and IIPC values due to a

reduction in the capacity misses in the TLB. For instance, IF for the 128-entry TLB is

almost four times that of the IF for the 64-entry TLB. However, the IF for 4096-entry TLB

is almost the same as for 1024-entry TLB. At these large TLB sizes, most of the required

translations are cached in the TLB and the dominant reason for the TLB-induced delay

is the TLB misses due to TLB flushes which does not change on increasing the TLB

size. Hence, the IF and IIPC do not vary significantly at these sizes. It is also clear that

the trend in IF is different from the Reduction trends for ITLB and DTLB miss rates.

dbench, as seen from Figure 4-7, shows an IF trend similar to TPCC-UVa, i.e.,

increasing rapidly for smaller TLB sizes and showing smaller increments at larger TLB

sizes. The significant difference is in the actual values of the Impact Factor IF . For

instance, the IF for dbench with a 1024 entry TLB and 60-cycle page walk latency is

22.04%, which is less than half of the 49.65% seen for TPCC-UVa. The reason for

this behavior is the flush profile of these workloads. Over a simulation run of 10 billion

x86 instructions, with CR3 tagging, dbench experiences 25263 flushes, all of which

84

0

1

2

3

4

5

6

7

8

64-e

ntry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024

-ent

ry T

LB

4096

-ent

ry T

LB

64-e

ntry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024

-ent

ry T

LB

4096

-ent

ry T

LB

64-e

ntry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024

-ent

ry T

LB

4096

-ent

ry T

LB

64-e

ntry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024

-ent

ry T

LB

4096

-ent

ry T

LB


I IPC (%

)

A Increase in IPC IIPC

0

10

20

30

40

50

60

70

80

90

64-e

ntry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024

-ent

ry T

LB

4096

-ent

ry T

LB

64-e

ntry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024

-ent

ry T

LB

4096

-ent

ry T

LB

64-e

ntry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024

-ent

ry T

LB

4096

-ent

ry T

LB

64-e

ntry

TLB

128-

entry

TLB

256-

entry

TLB

512-

entry

TLB

1024

-ent

ry T

LB

4096

-ent

ry T

LB


IF (%

)

B Impact Factor IF

Figure 4-7. Increase in workload performance using an 8-entry TMT at PW of 60 cyclesand 8-way associativity. Using the TMT eliminates a significant fraction of theTLB-induced delay, except for dbench. The impact is limited for dbench dueto the predominance of forced flushes.

85

are forced flushes. On the other hand, TPCC-UVa experiences only 7686 flushes of

which 2586 flushes are capacity flushes and 5100 are forced flushes. This higher rate

of un-avoidable flushes reduces the impact of CR3 tagging for dbench compared to

TPCC-UVa. Thus, the IF is 20%, even at TLB size of 4096 entries.

SPECjbb differs from TPCC-UVa as it has a significant number of capacity driven

TLB misses. The delay due to the TLB misses in SPECjbb is primarily caused by its

working set size and the lack of space in the TLB, rather than the flushing of the TLB.

Hence, the benefit of larger TLB sizes is more pronounced and the increase in IF with

TLB sizes is steeper than for TPCC-UVa, as seen from Figure 4-7. It is also observed

that the improvement in the transaction rate (throughput) of SPECjbb, obtained by

instrumenting SPECjbb to indicate the completion of every transaction, tracks IIPC

closely. For instance, for a 60-cycle PW, the transaction rate of SPECjbb is improved by

3.29% and 7.21% for TLB sizes of 1024 entries and 4096 entries respectively.

Since Vortex also has a large number of capacity driven TLB misses, the IF trend

is closer to SPECjbb than TPCC-UVa. The difference, as seen from Figure 4-7, lies in

the actual values of IF . At TLB sizes of 64 entries, the IF for SPECjbb and Vortex are

very similar at 0.39% and 0.42% respectively. However, at a TLB size of 1024 entries,

IF for Vortex increase to 71%, which is about thrice the IF for SPECjbb. This difference

is due to the working set size of Vortex being smaller than SPECjbb and the majority of

its translations being accommodated in a 1024-entry TLB unlike SPECjbb. This effect

of a large IF , when the TLB becomes sufficiently large to capture the entire working

set is seen for SPECjbb also at a size of 4096 entries. Both these workloads fulfill

the expectation of large IF , i.e. benefit of the TMT, at large TLB size as predicted in

Section 4.4.2.

Sensitivity of IIPC to the page walk latency

There are recent virtualization-driven enhancements such as Nested Page Tables

(NPT) [36] or Extended Page Tables (EPT) [37] that indicate that page walk latencies

86

0

5

10

15

20

25

30

PW 3

0

PW 6

0

PW 9

0

PW 1

80

PW 2

70

PW 3

0

PW 6

0

PW 9

0

PW 1

80

PW 2

70

PW 3

0

PW 6

0

PW 9

0

PW 1

80

PW 2

70

PW 3

0

PW 6

0

PW 9

0

PW 1

80

PW 2

70


I IPC (%

)

Figure 4-8. Effect of the Page Walk Latency on the improvement in performance with8-entry TMT, 1024-entry 8-way TLB. The performance improvement due tothe TMT is significantly higher at larger PW values.

can further increase. Unlike the one-level Shadow Page Tables being used for address

translation in processors without this extension, processors with EPT/NPT support have

two levels of page tables, both of which are used for translating a virtual to physical

address. This two-level translation increases the cost of a TLB miss significantly. To

understand the impact of larger cost of a TLB miss, TPCC-UVa, dbench, SPECjbb

and Vortex are simulated on a 1-CPU x86 machine with 8-way regular and tagged TLB

(8-entry TMT) under different values of the minimum page walk latency (PW). The IIPC

values obtained from these simulations are shown in Figure 4-8. From these values, it

can be seen that using the TMT increases the IPC of TPCC-UVa by about 12% at PW

latency of 270 cycles for a 1024-entry TLB. Similarly, the IPC of SPECJbb and Vortex

increases by about 12% and 25%, respectively. In the case of SPECjbb, it is known from

the data presented in Section 4.4.3 that 1024-entry TLB is not sufficient to capture the

87

entire working set size. Though not shown in Figure 4-8, at PW of 270-cycles and a TLB

size of 4096 entries, the IIPC for SPECjbb increases to about 28%.

4.5 Architectural and Workload Parameters Affecting the Impact of the TMT

The impact of the Tag Manager Table is in reducing the TLB-induced delay and

thereby improving the performance of the virtualized workload. However, this impact

depends on a few hardware parameters and workload factors. These factors and their

influence on the improvement due to the TMT are presented in this section. These

factors and parameters are also prioritized depending on the significance of their

influence. It should be noted that, for the simulations presented in this section, Reduction

is used as the figure of merit as it is more sensitive than IIPC .

4.5.1 Architectural Parameters

While the architectural parameters that affect the TLB behavior and the benefit

of using the TMT are discussed in depth in Section 4.4, they are summarized in this

section.

• The size of the Tag Manager Table decides the number of context switch relatedTLB flushes that cannot be avoided due to lack of capacity in the TMT.

• The size of the TLB controls the number of TLB entries of different processes thatare retained across context switch boundaries when the associated TLB flushesare avoided.

• The associativity of the TLB, beyond 8-way set size, does not play a significantimpact on the benefit of using the TMT.

• The value of the minimum page walk latency (PW) influences the cost of TLBmisses, and therefore the benefit that is obtained from avoiding these misses usingthe TMT.

4.5.2 Workload Parameters

From the discussion in Section 4.4.2, it is evident that the TLB size is an important

parameter which affect the benefit that can be obtained from tagging. A small TLB will

experience capacity misses irrespective of whether tags are used to avoid flushes or

not. However, whether the size of the TLB is ”small” or ”large” depends on the workload

88

and the number of pages that are accessed by the workload. Similarly, the number

of flushes that can be avoided by tagging and the reduction in miss rate, depend on

the number and type of TLB flushes experienced by the workload. Thus, the benefit of

tagging the TLB entries will depend on the workload characteristics.

4.5.2.1 Effect of larger memory footprint

To examine the impact of the working set size of the workload, the SPECjbb

benchmark is selected. The memory utilized by SPECjbb is capped by the heap size of

the Java Virtual Machine (JVM) in which it runs. By increasing the heap size of the JVM,

the working set size of the workload can be varied thereby varying demand exerted on

the TLB.

Four different SPECjbb-based workloads with heap sizes of 128MB, 192MB,

256MB and 320MB are prepared by launching SPECjbb in the domU of a simulated

single-processor machine. The workloads are run for 8-way TLBs of sizes varying from

64 entries to 8192 entries3 without tagging and their miss rates and flush profiles are

observed. Then, the simulations are repeated with CR3 tagging and an 8-entry TMT and

the miss rates and flushes are observed.

From the TLB flushes for the four workloads, shown in Table 4-1, it can be seen that

varying the heap size does not change the number of flushes significantly. In situations

without TLB tags as well as with tags, the flushes for the workload with varying heap

sizes all fall within 4% of each other, which is due to the variations caused by the system

noise. Moreover there is little correlation between increasing the heap size and the

increase in the number of flushes. Thus, varying the heap size does not affect the flush

profile significantly and any variation in the observed TLB miss rate is due to the impact

of the differing working set sizes.

3 The TLB size is varied till 8192 entries to illustrate the shift in the Reduction trend.

89

Table 4-1. Flush profile for SPECjbb-based workloads with varying heap sizesHeap Size(MB)

Flushes withoutTags

Capacity Flushes withCR3 Tags, 8-entryTMT

Forced Flushes withCR3 Tags, 8-entryTMT

128 32519 0 1189192 33550 4 1205256 32915 0 1175320 33012 0 1151

When the Reduction in DTLB miss rates, as shown in Figure 4-9A, are considered it

can be seen that there is a systematic correlation between the heap size, the TLB size

and the improvement due to tagging. At very small TLB sizes, the change in heap size

does not change the miss rate improvement due to tagging. Up to a TLB size of 256

entries, even the smallest heap size of 128MB is sufficient to cause a large number of

capacity misses in the TLB. Hence, the four workloads exhibit an identical, albeit small,

Reduction of 6% in the DTLB miss rate.

However, at a TLB size of 512 entries, the TLB is ”large” for SPECjbb with 128MB

heap size and ”small” for SPECjbb with 320MB heap size. Hence, the Reduction in miss

rate varies by about 4% when the heap size is changed from 128MB to 320MB. This

trend of varying miss rate improvement for varying heap sizes is more pronounced at

1024 entry and 2048 entry TLB sizes. For a TLB size of 2048 entries, the reduction in

miss rate for SPECjbb with 128MB heap size is 30% more than for SPECjbb with 320MB

heap size. Beyond the size of 2048 entries, however, the TLB becomes ”large” enough

to accommodate even a 320MB heap size. Hence, the variation in the impact of tagging

becomes reduces and eventually diminishes. Thus, it is clear that the working set size,

in combination with the DTLB size, affects the improvement that can be obtained from

tagging.

The ITLB miss rates from this experiment are presented in Figure 4-9B. It can be

seen that the reduction in the ITLB miss rate does not vary significantly with the working

90

0

10

20

30

40

50

60

70

80

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

64-entry

TLB

128-entry

TLB

256-entry

TLB

512-entry

TLB

1024-entry

TLB

2048-entry

TLB

4096-entry

TLB

8192-

entry

TLB

Red

uctio

n (

%)


0

10

20

30

40

50

60

70

80

90

100

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

12

8 M

B1

92

MB

25

6 M

B3

20

MB

64-entry

TLB

128-entry

TLB

256-entry

TLB

512-entry

TLB

1024-entry

TLB

2048-entry

TLB

4096-entry

TLB

8192-

entry

TLBR

ed

uctio

n (

%)


Figure 4-9. Effect of scaling the memory footprint on the reduction in TLB miss rate withan 8-entry TMT. The reduction in DTLB miss rate is affected by the memoryfootprint of the workload when the TLB size is between 512 entries and 2048entries. Outside this range, the TLB is either too small or large enough to notbe influenced by the memory footprint. The reduction in ITLB miss rate is notsignificantly affected by the memory footprint of the workload.

set size as increasing the heap size does not affect instruction footprint and the ITLB

usage significantly.

4.5.2.2 Effect of the number of processes in the workload

While varying the heap size changes the pressure exerted on the DTLB, it does not

stress the ITLB. However, on increasing the number of processes in a workload, each

of these processes will require a share in the ITLB and thereby increase the demand for

space in the ITLB. Thus, varying the number of processes in a multi-process application

will create different workloads which are suitable for investigating the relation between

the workload characteristics and the impact of the TMT in reducing the ITLB miss rate.

91

To create such workloads, TPCC-UVa is utilized. Four different TPCC-UVa based

workloads are prepared by changing the number of warehouses in the benchmark from

1 to 8. Since one client process is forked off for every warehouse, these four workloads

have differing number of processes, each of which will utilize a portion of the ITLB

space. These workloads are run on the domU of a simulated uniprocessor x86 machine

with 8-way TLBs of sizes ranging from 64 entries to 1024 entries. The simulations are

run, both with and without tagging, and the flush profile, miss rates and Reduction in

miss rates for the different workloads are observed.

The flush profile for the four different TPCC-UVa workloads is shown in Table 4-2.

In the untagged TLB case, the number of flushes increase by 53% when the number of

warehouses are increased from 1 to 8. A similar trend is seen even when CR3 tags are

used. At a small TMT size of 2 entries, the reduction in the number of flushes is about

60% for 1-warehouse workload and 56% for a 8-process workload. At 8-entry TMT, the

capacity flushes are smaller than the forced flushes and stop being the predominant

source of flushes. On further scaling up the TMT size to 16 entries, the capacity flushes

reduce to 0 for all but the 4-warehouse workload.

The impact of varying number of processes on the Reduction in ITLB miss rates,

with a 2-entry TMT, is presented in Figure 4-10A. The behavior of the reduction in the

ITLB miss rates, for TLB sizes between 64 entries and 512 entries mimic the DTLB

miss rate reduction behavior between TLB sizes of 512 entries to 2048 entries for the

SPECjbb workloads from Figure 4-9A. The difference in the TLB size range where this

behavior is exhibited is due to the smaller working set size of the individual processes of

TPCC-UVa workload.

Another interesting difference is that the spread in the improvement curves is much

higher than the spread in the DTLB improvement curves for the SPECjbb workloads.

At the widest point of separation, i.e., at TLB size of 128 entries, the reduction in TLB

miss rate for a one-warehouse workload is almost twice that of the eight-warehouse

92

Table 4-2. Flush Profile for TPCC-UVa based workloads with varying number ofprocesses and varying TMT sizes

TMT Size Number ofWarehousesprocesses

Flusheswithout Tags

CapacityFlushes withCR3 tags

ForcedFlusheswith CR3 tags

2

1 49692 18944 19152 54521 20876 26724 63480 24222 36708 76338 29327 4654

4

1 49692 4302 19152 54521 4928 26724 63480 6134 36708 76338 4654 4654

8

1 49692 400 19152 54521 610 26724 63480 959 36708 76338 1872 4654

16

1 49692 0 19152 54521 0 26724 63480 1 36708 76338 0 4654

workload. This is caused because, in addition to the variation caused by the differing

TLB demands, the number of flushes also varies significantly for the different workloads.

Thus, in addition to the TLB size, the TMT size is another parameter which may be

”large” or ”small” depending on the workload.

Increasing the TMT size will result in further reduction of the TLB miss rates. This

is shown in Figure 4-10B, where the reduction in ITLB miss rate for two extreme TMT

sizes of 2 entries and 16 entries is shown. From Table 4-2, it is clear that a 16 entry TMT

eliminates all but forced flushes. This is reflected in the miss rate of the 1-warehouse

workload for a 64 entry TLB reducing by 17% with an 16 entry TMT as compared to 14%

for a 2 entry TMT. This disparity increases as the TLB size increases and at 1024 entries

the one-warehouse TPCC-UVa’s reduction with 16 entry TMT is almost twice that with 2

entry TMT.

93

0

5

10

15

20

25

30

35

40

45

1W

2W

4W

8W

1W

2W

4W

8W

1W

2W

4W

8W

1W

2W

4W

8W

1W

2W

4W

8W

64-entry TLB 128-entry TLB 256-entry TLB 512-entry TLB 1024-entry

TLB

Red

uctio

n (

%)

A Reduction in ITLB miss rate with 2-entry TMT

0

10

20

30

40

50

60

70

80

90

100

TM

T S

ize

2

TM

T S

ize

16

TM

T S

ize

2

TM

T S

ize

16

TM

T S

ize

2

TM

T S

ize

16

TM

T S

ize

2

TM

T S

ize

16

TM

T S

ize

2

TM

T S

ize

16

TM

T S

ize

2

TM

T S

ize

16

TM

T S

ize

2

TM

T S

ize

16

TM

T S

ize

2

TM

T S

ize

16

TM

T S

ize

2

TM

T S

ize

16

TM

T S

ize

2

TM

T S

ize

16

1W 8W 1W 8W 1W 8W 1W 8W 1W 8W

64-entry TLB 128-entry TLB 256-entry TLB 512-entry TLB 1024-entry

TLBR

ed

uctio

n (

%)

B Reduction in ITLB miss rate on scaling TMTsize

Figure 4-10. Effect of the number of workload processes on the reduction in ITLB missrate with 8-way associative TLBs. The legend nW indicates n warehouses.The effect of the number of workload processes on the reduction in ITLBmiss rate for a given TMT size is pronounced at smaller TLB sizes, butreduces for larger TLB sizes. Increasing the TMT size increases thereduction in miss rate.

4.5.3 Sensitivity Analysis

In order to achieve the most benefit from using the TMT, i.e. maximize the Reduction

of TLB miss rates while minimizing the size of the TLB and TMT, the relative significance

of various parameters in determining the reduction in miss rate should be understood.

For this, a Full Factorial Experiment [96] is performed. Additional details on Full Factorial

Experiments are presented in Appendix A.

To perform this evaluation, four different types of workloads are chosen such that

they occupy different quadrants in a two dimensional space. The number of flushes

and the working set size form the two axes of this space. TPCC-UVa has a smaller

working set size, compared to SPECjbb, but a larger number of TLB flushes and

94

Table 4-3. Factors and their levels for the sensitivity analysisFactor Range of ValuesTLB Size 64, 128, 256, 512, 1024TLB Associativity 4, 8, 16, 32, 64, fullTLB replacement policy FIFO, LRUTMT size 2, 4, 8, 16Flushes / 10B instructions High (≥ 30000), LowMemory Used High (≥ 100MB), Low

lies in the smaller-memory higher-flushes quadrant. Vortex has a memory usage

similar to TPCC-UVa as measured using the Linux top [87] command, but experiences

lesser flushes than TPCC-UVa and lies in the smaller-memory lower-flushes quadrant.

SPECjbb is a good candidate for the higher-memory smaller-flushes quadrant and

a consolidated workload with TPCC-UVa and SPECjbb is created to serve as the

higher-memory higher-flushes workload. These four workloads are simulated for all

possible combinations of the parameters listed in Table 4-3. It should be noted that the

factors listed in Table 4-3 are controllable design parameters and understanding the

influence of these parameters on the improvement in miss rate due to tagging will help in

design trade-offs. Page Walk latency is not included as a factor in the listing as it is not a

controllable design parameter. From these simulations, the reduction in DTLB and ITLB

miss rates for various parameter combinations are calculated.

By analyzing the variation among all DTLB miss rate reduction for all these

combinations, the most significant factor in determining the reduction is identified as

the TLB size with a 65.14% significance. The other dominant factors in determining

the DTLB miss rate improvement are from workload characteristics (memory size and

number of flushes) as seen from Table 4-4. These two factors and their interaction have

a relative influence of almost 20% in determining the impact of tagging. The interaction

between TLB size and memory utilization, i.e. having a larger TLB for workloads using

more memory, is also significant.

95

Table 4-4. Factors with significant influence on the Reduction in TLB miss rates due toCR3 tagging

S.No

Factor Influence inDTLB miss ratereduction

Influence inITLB miss ratereduction

1 TLB Size 65.14% 70.92%2 Flushes / 10B

instructions3.66% 12.89%

3 Memory Used 14.85% 1.85%4 TMT Size 1.42% 1.94%5 TLB Size*Flushes 1.45% 5.02%6 TLB Size*Memory 5.75% 1.47%

On performing a similar analysis for the ITLB, the relative significance of workload’s

memory utilization in determining the ITLB miss rate reduction is found to be only 1.8%,

whereas the number of flushes exerts 12.89% influence as shown in Table 4-4. The

primary factor which determines the ITLB improvement is the TLB size with 70.9%

influence. It is also verified from the Full Factorial Experiment that the associativity of the

TLB and the replacement policy used in the TLB play only minor roles in deciding the

impact of CR3 tagging for both ITLB and DTLB.

4.6 Comparison of Process-Specific and Domain-Specific Tags

To compare the performance benefit of using process-specific tags using the TMT

and domain-specific tags as in the qTLB [18], the generic tagged TLB model developed

in Chapter 3 is used to model the qTLB solution by mapping the domain-specific tag

generation functionality to the GMT module and maintaining the current VM’s tag in

the TagCache. Then, TPCC-UVa and Vortex are simulated using both process-specific

and domain-specific tagging strategies and the IIPC values the workload with both

types of tagging are observed. Comparing these values, as shown in Figure 4-11, it is

clear that the improvement in IPC is much higher when TMT is used. For TPCC-UVa,

using the TMT results in increasing the performance by more than 10× compared

to domain-specific tags. Moreover, the dependence of IIPC using qTLB on the TLB

size is less marked than IIPC from using the TMT, as only the hypervisor mappings

96

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P

PW30

PW60

PW90

PW30

PW60

PW90

PW30

PW60

PW90

PW30

PW60

PW90

PW30

PW60

PW90

64-entry TLB 128-entryTLB

256-entryTLB

512-entryTLB

1024-entryTLB

I IP

C (

%)

A IIPC comparison for TPCC-UVa

0

1

2

3

4

5

6

7

8

9

10

Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P Q P

PW30

PW60

PW90

PW30

PW60

PW90

PW30

PW60

PW90

PW30

PW60

PW90

PW30

PW60

PW90

64-entry TLB 128-entryTLB

256-entryTLB

512-entryTLB

1024-entryTLB

I IP

C (

%)

B IIPC comparison for vortex

Figure 4-11. Comparison of the performance improvement due to process-specific andVM-specific tagging. Process-specific tagging with an 8-entry TMT (legendP) increases the IPC significantly more than VM-specific tagging usingqTLB approach [18] (legend Q) as it can avoid all types of context switchrelated flushes. The advantage of process-specific tagging is even morepronounced in non I/O-intensive Vortex where there is little inter-domaincontext switches.

are retained on domain switches in the qTLB. Once the TLB grows large enough to

accommodate all the hypervisor entries (256 entries in the case of TPCC-UVa), the gain

from further increasing the TLB size is minimal. The ratio of IIPC values with CR3 tagging

to domain-specific tagging is even more pronounced for Vortex due to the significantly

smaller number of inter-domain switches in Vortex. These results clearly show the

benefit of using process-specific tags over domain-specific tags.

4.7 Using the Tag Manager Table on Non-Virtualized Platforms

While the TMT is motivated by the need to reduce TLB-induced performance

degradation on virtualized platforms, it achieves this by avoiding TLB flushes using a tag

to associate every TLB entry with the process to which it belongs. Since the generation

and management of the VASI is not tied to any particular aspect of virtualization, the

97

TMT may also be used on non-virtualized platforms without requiring any change to the

system software. As a result, the same hardware platform may be used in a virtualized

or non-virtualized manner transparent to the software stack running on it.

To estimate the performance implications of using the TMT on non-virtualized

single-O/S platforms, an x86 single-core machine is simulated using the experimental

framework developed in Chapter 3. Debian Linux 2.6.18-pae kernel is booted on this

simulated platform and I/O-intensive TPCC-UVa as well as memory-intensive vortex

are run on it. The IPC for these workloads, with either a regular 8-way TLB and tagged

8-way TLB and 8-entry TMT and a 60-cycle PW, is observed. The simulations are

repeated for varying TLB sizes and the IIPC as well as the IF for these workloads are

calculated from these simulations. These values are presented in Figure 4-12.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

64

-en

try T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-en

try T

LB

64

-en

try T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-en

try T

LB

64

-en

try T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-en

try T

LB

64

-en

try T

LB

12

8-e

ntr

y T

LB

25

6-e

ntr

y T

LB

51

2-e

ntr

y T

LB

10

24

-en

try T

LB

TPCC-Uva Vortex TPCC-Uva Vortex

Improvement in IPC Impact Factor

I IP

C (

%)

0

10

20

30

40

50

60

70

80

90

100

IF (

%)

Figure 4-12. Performance impact of TMT on non-virtualized platforms with 60-cycle PWand 8-way TLB. The IIPC is presented on the left Y axis and the IF ispresented on the right Y axis. The TMT is quite effective at eliminatingTLB-induced delays for workloads running on non-virtualized platformseven if the performance implications are not highly significant.

98

From Section 3.5.1, it is clear that the number of flushes is much smaller in a

single-O/S scenario than on a virtualized platform. Given this low flush rate, the

predominant cause for TLB flushes is the lack of TLB space. As expected, Figure 4-12

shows an increasing trend in the IIPC values with TLB size. It is observed that the IIPC

due to the TMT on non-virtualized platforms is quite small. For instance, even for a

1024-entry TLB and for Vortex, the IIPC is only about 1.6%, compared to the 5.9% for the

virtualized Vortex as presented in Figure 4-7A. However, since the TLB-induced delay

is small on single-O/S platforms, this improvement in IPC translates to an IF of 75%.

Similarly, using the TMT for TPCC-UVa results in an IIPC of about 0.5% and an IF of

89%. Thus, the TMT is quite effective at eliminating TLB-induced delays for workloads

running on non-virtualized platforms even if the performance implications are not highly

significant. However, the most important observation from these simulations is that the

TMT can be used with no change in design for non-virtualized scenarios.

4.8 Enabling Shared Last Level TLBs Using the Tag Manager Table

A well known principle of data caching is the reduction in miss rate, and therefore

the stalls due to the cache misses, on increasing the size of the cache. However, the

purpose of caching the data, i.e. reducing the time taken to access the data by finding it

in the cache rather than the main memory, is defeated when the cache size increases to

large values. For instance, it has been estimated that the ”hit time” for 1MB cache, using

35nm technology is about 6ns [97]. A well known solution to this problem is creating a

hierarchy of caches, with the smaller and faster caches closer to the CPU and the larger

and slower caches closer to the memory. By having such multi-level caches, any miss

in the first level cache which finds the data in the second level cache pays a smaller

penalty than accessing the data from the main memory.

When such hierarchical cache organizations in current CMP platforms are

considered, the last level cache (LLC) is usually shared amongst multiple cores

and serves the cache misses from the private cache hierarchies of each of these

99

cores. Such shared LLCs are especially beneficial for workloads which share data.

Even in workloads that do not have significant sharing, aggregating the on-chip area

allocation for the last level caches as a shared LLC instead of multiple per-core private

LLCs has been shown to result in a lower miss rate [98] due to the better utilization

of the shared cache space, even when there is little sharing between the different

processes which share the cache. Moreover, by caching a block in the last level of a

fully-inclusive hierarchy, the need for snooping among the upper level caches of the

different processors can be avoided [99].

Due to increasing importance of the TLB on current platforms, the hierarchical

design is being extended to TLBs as well4 . AMD Athlon processors [48] support two

levels of instruction and data TLBs, with a 512-entry L2 ITLB and a 640-entry L2 DTLB.

Similarly, Intel Nehalem processors [4] have a 512-entry L2 instruction and data TLBs.

However, these multi-level TLBs are organized as private per-core hierarchies with no

shared Last Level TLB (LLTLB). Previous work has shown that having a Shared Last

Level TLB will exploit inter-core sharing where a specific entry brought into the LLTLB

may be used by all other cores, thereby avoiding TLB misses and page walks for those

cores [19].

4.8.1 Using the TMT as the Tagging Framework

The primary requirement for sharing the Last Level TLBs, in hardware-managed

TLBs such as on the x86 platform, is the need to distinguish the TLB entries of one

process from the entries of another process. This may be achieved using process-specific

tags which are generated and managed using the Tag Manager Table.

4 Even though there are two levels, it should be noted that both levels of the TLB areused store the virtual to physical address translations, even in virtualized scenarios withtwo-level page tables, and not virtual to real or real to physical address translations.

100

When using the TMT, as discussed in Section 4.2, every TLB is provided with

its own TMT. As a result the CR3-to-VASI mapping in one TLB may be different from

the mapping established in another TMT, and the TLB entries of the same process

address space may be tagged with different VASIs in different TLBs. Such an approach

is satisfactory even in multi-level TLBs provided there is no shared TLBs. However, in

the case of shared TLBs, it is important to have a consistent process-to-tag mapping

in all TLBs to ensure that an entry in the shared TLB can be used by any core which

shares this TLB. Thus, establishing this consistent process-to-tag mapping is the second

requirement for enabling shared LLTLBs. One way of satisfying this requirement is to

have one global TMT which generates and manages the tags for all per-core private TLB

hierarchies which share the LLTLB.

4.8.2 Architecture of the Shared LLTLB

The architecture of the shared Last Level TLB using the Tag Manager Table

is illustrated in Figure 4-13. The platform illustrated in Figure 4-13 consists of two

processors, CPU0 and CPU1 with a two-level TLB hierarchy for each core. It should

be noted that, though the architecture is explained considering a dual-core platform, a

similar architecture may be envisioned for sharing the LLTLB among a larger number

of processors. L0 − TLB0 and L0 − TLB1 are the private per-core TLBs of CPU0 and

CPU1 respectively. The second level TLB, indicated as L1 − TLBS in Figure 4-13, is

the LLTLB which is shared among these cores. One global TMT is used to generate

and manages tags for all three TLBs. However, every core is provided with its own CCR

register to ensure that no additional latency is imposed by that tagging framework on the

critical TLB lookup path.

TLB lookup and miss handling with shared Last Level TLBs

The TLB lookup process in the shared LLTLB scenario happens as shown in

Figure 4-13. A process P0 running on CPU0 with the tag VASI0 may require a

translation for virtual address VA0. If this translation is not available in L0 − TLB0,

101

CR3

TAG

VASI

L0-TLB1

VPN PPNCR3

TAG

VASI

L0-TLB0

VPN PPN

CR3 SID

GLOBAL TAG

MANAGER TABLE

VASI

CCR0

CR3

TAG

VASI

L1-TLBS

VPN PPN

CR3 SID VASI

1 4

MISS IN L1-TLBS CAUSES

PAGE WALK. ENTRY

ADDED IN LLTLB

2

ENTRY

COPIED TO

L0-TLB1

3

MISS IN

L0-TLB2

5 HIT IN L1-TLBS

ENTRY

COPIED TO

L0-TLB2

6

CR3 SID VASI CCR1

SIDSID

SID

MISS IN

L0-TLB1

Figure 4-13. Using the TMT for Shared Last Level TLBs. Two private per-core first levelTLBs, L0-TLB0 and L0-TLB1 as well as a second level (Last Level) sharedTLB, L1-TLBS, are shown. A uniform CR3-to-VASI mapping is ensured byusing a global TMT for all TLBs. However, every core is provided with itsown CCR register.

as shown in Step ..1 , this will trigger a lookup in the Last Level Shared TLB L1 − TLBS .

The VASI in the CCR of CPU0 is dispatched to the LLTLB as a part of the LLTLB lookup.

Only if a translation for VA1 with this VASI tag VASI1 is found in the LLTLB will the

lookup result in a hit. If this entry is not present in the LLTLB, the TLB lookup is declared

as a TLB miss and a page walk triggered. On completion of the page walk the entry

is cached in L1 − TLBS with tag VASI1. This is shown in Step ..2 . After this entry is

cached in the LLTLB, to maintain the fully inclusive nature of the TLB hierarchy, the entry

is copied to L0 − TLB0 as shown in Step ..3 . Once this entry (VA0,VASI0) is cached

in the LLTLB, it will be available to service any TLB misses from either L0 − TLB0 or

L0− TLB1.

For instance, the process P0 may get scheduled on CPU1 at some point in time

after the (VA0,VASI0) entry gets cached in the LLTLB. If the translation for VA0 is

102

required by P0 and it is not found in L0 − TLB1 as shown in Step ..4 , the lookup will hit

in L1 − TLBS and avoid an expensive TLB miss as depicted in Step ..5 and Step ..6 . In

addition to P0 being rescheduled on CPU1, threads of a multi-threaded workload which

share the address space will benefit from such a shared LLTLB. It should be noted that,

while this discussion focusses on fully-inclusive TLB hierarchies, the use of the TMT to

enable shared LLTLBs is equally applicable in the case of exclusive TLB hierarchies as

well.

TLB flush handling with shared Last Level TLBs

One implication of using a global TMT is the generation of ”false” TLB flushes. A

situation may arise during a context switch from P1 to P2 (on CPU1) where the CR3

of P2 is not in the TMT and, due to limited capacity, there are no free entries in the

TMT. In this case, depending on the replacement policy, the victim entry in global TMT,

(CR33,SID3,VASI3), belonging to P3 is chosen. The CR3 and SID values of P2 replace

CR33 and SID3 while VASI3 is reused for P2. To avoid the TLB entries of P3 being used

for P2, the per-core private TLB hierarchy of CPU1 is flushed with a capacity flush.

However, in shared LLTLB scenarios, this capacity flush will also have to flush all TLB

hierarchies that share the LLTLB with CPU1’s TLB hierarchy including the shared LLTLB.

This is required to ensure that no entry belonging to P3 is cached in any of these TLBs

with the tag VASI3. Thus, the capacity flushes experienced by the shared LLTLB is the

sum of the capacity flushes experienced by the private last level hierarchies it replaces.

However, the number of slots in the Global TMT can be set as the sum of the number

of slots in all the per-TLB TMT it replaces, thereby reducing the occurrence of capacity

flushes.

Forced flushes, on the other hand, are propagated to all TLB hierarchies using

mechanisms such as Inter-processor interrupts [53] even in existing platforms. Hence

the number of forced flushes experienced by the LLTLB remains constant irrespective of

whether the it is shared or not.

103

4.8.3 Miss Rate Improvement Due to Shared Last Level TLBs

In addition to the benefit for workloads which share address spaces, using shared

LLTLBs will result in a better utilization of the TLB space. Thus, allocating a fixed amount

of TLB space as a shared TLB rather than as two private TLBs will result in reducing the

TLB miss rate. To understand the reduction in miss rate that can be achieved using the

shared LLTLB for virtualized workloads, a two-processor x86 machine is simulated using

the experimental framework described in Section 3.2. The tagged TLB model developed

in Section 3.2.2 is modified to include an interface to facilitate communication between

two levels of a TLB hierarchy. Using this tagged TLB model, both CPUs in the simulated

platform are configured with a two-level private per-core TLB hierarchy with no sharing

of the last level TLB.

Xen is booted on this platform and the pinned workloads TPCC-Vortex-0102 and

TPCC-SPECjbb-01025 are created. Pinning the workloads in this fashion ensures that

dom1 running TPCC-UVa is the only workload domain to be scheduled on CPU0 and

dom2 running Vortex or SPECjbb gets scheduled only on CPU1. These workloads

are run on the simulated platform with a 64-entry first level TLB and varying last level

TLB sizes with an 8-entry per-core TMT and the miss rates for the various TLBs are

observed. Then, the second level TLB of both the private per-core hierarchies are

replaced with a shared TLB and the 8-entry TMTs are replaced with a 16-entry global

TMT. The simulations are repeated for varying shared LLTLB sizes and the miss rates

for the various TLBs are observed.

The DTLB miss rates for the private and shared LLTLBs from these simulations

are compared in Figure 4-14. It should be noted that the miss rates for a private LLTLB

of a certain size is compared to the miss rate of the shared LLTLB of twice the size.

5 The details of creating the pinned workloads and their nomenclature are explainedin Section 3.3.3

104

0

0.5

1

1.5

2

2.5

3

3.5

Pri

va

te

Sh

are

d

Pri

va

te

Sh

are

d

Pri

va

te

Sh

are

d

Pri

va

te

Sh

are

d

Pri

va

te

Sh

are

d

Pri

va

te

Sh

are

d

Pri

va

te

Sh

are

d

Pri

va

te

Sh

are

d

Pri

va

te

Sh

are

d

Pri

va

te

Sh

are

d

Pri

va

te

Sh

are

d

Pri

va

te

Sh

are

d

TPCC Vortex TPCC Vortex TPCC Vortex TPCCSpecjbbTPCCSpecjbbTPCCSpecjbb

64-entry TLB 256-entry

TLB

1024-entry TLB 64-entry TLB 256-entry

TLB

1024-entry

TLB

TPCC-Vortex-0102 TPCC-SPECjbb-0102

LL

TL

B M

isses p

er

Th

ou

san

d I

nstr

ucti

on

s (

MP

KI)

Figure 4-14. Reduction in DTLB miss rate due to Shared Last Level TLB. The TLB sizespecified on the X-axis is the size of the private per-core LLTLB and halfthe size of the shared LLTLB. Having a shared Last Level TLB reduces theDTLB miss rate by 0% to 35% depending on the TLB size and workload.

From this, it is observed that the shared LLTLB has a lower miss rate compared to the

private per-core LLTLB. For instance, a 64-entry private per-core LLTLB results in miss

rates of 0.23 MPKI and 2.42 MPKI for TPCC and Vortex in TPCC-Vortex-0102 workload

respectively. However, when the private LLTLBs are replaced with a shared 128-entry

TLB, the miss rate of Vortex drops to 2.11 MPKI, a 13% reduction. It is also observed

that this reduction is significantly higher for Vortex and SPECjbb which have higher data

memory footprint compared to TPCC. In the case of TPCC, the potential increase in

TLB space due to using a shared LLTLB is offset by the significantly higher usage of

that shared TLB by Vortex and SPECjbb. Having TLB usage controls may be envisioned

to increase the benefit of the shared LLTLB for TPCC-UVa. However, the miss rate

for TPCC-UVa is never larger when using shared LLTLB compared to private per-core

LLTLB. The average reduction in DTLB MPKI for SPECjbb and Vortex are about 15%

105

and 28% respectively. These results clearly demonstrate the benefit of using shared

Last Level TLBs.

4.9 Summary

The Tag Manager Table is proposed in this chapter to generate and manage

process-specific TLB tags in a software-independent manner for hardware-managed

TLBs. The design and working of the TMT is discussed and the reduction the TLB

miss rate and TLB-induced delay due to the TMT is analyzed. The various hardware

and workload-related factors that influence the benefit of the TMT are investigated and

prioritized. It is found that using the TMT for typical transaction-processing and CPU

intensive workloads reduces the delay due to TLB misses by as much as 50%-70%

compared to untagged TLBs and improves the IPC by as much as 12%-25% for large

TLB sizes and page walk latencies. The use of the TMT in non-virtualized platforms as

well as to enable shared Last Level TLBs is also explored.

106

CHAPTER 5CONTROLLED SHARING OF HARDWARE-MANAGED TLB

Resource consolidation using virtualization has emerged as a viable way to share

the resources of chip multicore processors among multiple workloads which have

different operating system (O/S) requirements. By consolidating different workloads on

the same platform, the utilization of the platform resources can be increased. This has

made virtualization extremely attractive to the server industry.

In a consolidated environment, the performance of one virtual machine (VM)

will be susceptible to the utilization of shared resources by other VMs. In addition,

”system noise”, i.e. the operating system carrying out vital functions such as memory

management and task scheduling, also causes variation as well as degradation in the

performance of virtualized workloads. This interference manifests as consumption of

resources by other VM or system processes, which could have been otherwise devoted

to increasing the performance of user applications, and is a major limiting factor in

the performance of applications in large-scale systems [100, 101]. Hence, there is

a need for controlling and managing the usage of shared resources. Such resource

management techniques are vital for providing scalable and deterministic performance

in future architectures such as Datacenter-on-chip [102].

Resource management in CMP platforms for providing Quality of Service, especially

in the memory subsystem, has been the focus of many research efforts. Kim, Chandra

and Solihin [103] explore the sharing of caches for providing a fair share of the cache

to different hardware threads. Iyer et al. [104] and Hsu et al. [105] present different

types of cache-sharing policies for the last level cache for varied system-level goals,

including maximizing the system throughput and ensuring uniform throughput for each

of the threads. Chang and Sohi [106] discuss adaptively increasing the cache space

allocated to a thread in the short run, while maintaining fairness in the long run. Qureshi

and Patt [107] investigate the capability of different workloads to use the cache with

107

varying degrees of efficiency and use this information to decide the cache allocation.

Srikantaiah et al. [108] explore the pollution in the cache due to multiple cores sharing

the last level cache and propose schemes to reduce this pollution by modifying the

cache eviction policies. Architectural support for O/S-level cache management has been

investigated by Rafique, Lim and Thottethodi [109]. Selective replication [110] to improve

the performance of selected applications has been proposed by Beckmann, Marty and

Wood.

However, since only one process could use the TLB at a given time before the

advent of tagged TLBs for reducing the virtualization overhead, research on usage

control in hardware-managed TLBs is limited to the qTLB work [18]. This assumption of

a process owning the entire TLB, however, is changed in the context of tagged TLBs.

While the TMT enables the sharing1 of the TLB among multiple workloads, thereby

improving the performance of these workloads, it also makes the TLB a shared resource

and the performance of an application in one VM will vary depending on the TLB usage

of other VMs which run on the same core. This necessitates mechanisms and policies

for managing the use of the TLB.

To address this issue, the CShare (Controlled-Share) hardware-managed TLB

is proposed in this dissertation. At the core of the CShare TLB is the use of a TLB

Sharing Table (TST), in conjunction with TMT-generated process-specific tags for

sharing the TLB between multiple processes and for controlling the TLB space used

by these processes. By assigning various VMs a fixed slice of the shared TLB space

using the TST, the TLB behavior of a workload running in a VM can be isolated from the

TLB usage of other VMs running on the same platform. The TST can also be used to

1 The sharing of a single TLB by multiple processes is the main focus of this chapter.However, the architectures developed and analysis performed here are viable in thecontext of sharing across multiple TLBs, such as shared Last Level TLBs.

108

selectively improve the performance of a high priority workload by restricting the TLB

usage of other low priority workloads running on the same platform. In such scenarios,

the performance improvement for the high priority workload that is achieved using

the TMT can be further increased by 1.4× by restricting the TLB usage of low priority

workloads. The cost of this selective performance enhancement for various types of

workloads is analyzed and the use of dynamic usage control policies for minimizing this

cost and improving the overall performance of the consolidated workload is explored.

5.1 Motivation

Typical usage of virtualized platforms involve launching multiple workloads on a

platform, each in their VM, and having these VMs share resources. Thus it is important

to investigate the behavior of the tagged TLB for such consolidated workloads, in

addition with stand-alone workloads. To understand this, consolidated workloads are

created by launching two applications, TPCC-UVa and Vortex for instance, on dom1

and dom2. Though no application is launched on dom0, the interactions between domU

and the physical machine (such as I/O requests for TPCC-UVa) are served by the

drivers residing on dom0 and instructions are executed on this domain as well [35].

These consolidated workloads are run on a 1-CPU x86 simulated machine, using the

framework outlined in Section 3.2, and the IIPC due to the tagged TLB (without any

explicit usage control) is observed. In addition to the IIPC for the entire consolidated

workload, the details of the domain switches are obtained by instrumenting the Xen

kernel and are used to classify executed instructions on a per-dom basis, thus enabling

the calculation of IIPC on a per-domain basis. These IIPC values are shown in Figure 5-1.

From these simulation results, the following observations can be made:

• While dom0 does not run any actual workload, its IPC shows definite benefit fromincreasing the TLB size. In fact, even at large TLB sizes of 512 entries, furtherscaling up of the TLB size results in further increasing dom0’s IPC. This behavioris observed because, in all three workloads, dom0 is scheduled for less than 8% ofthe total running time. As a result, the TLB entries cached by dom0 get evicted bythe entries of dom1 and dom2, before they can be significantly reused.

109

0

1

2

3

4

5

6

7

8

9

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

64-entryTLB

128-entryTLB

256-entryTLB

512-entryTLB

1024-entryTLB

4096-entryTLB

I IP

C (

%)

A IIPC for TPCC(dom1)-Vortex(dom2)

0

1

2

3

4

5

6

7

8

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

64-entryTLB

128-entryTLB

256-entryTLB

512-entryTLB

1024-entryTLB

4096-entryTLB

I IP

C (

%)

B IIPC for TPCC(dom1)-Specjbb(dom2)

0

1

2

3

4

5

6

7

8

9

10

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

DO

M0

DO

M1

DO

M2

64-entryTLB

128-entryTLB

256-entryTLB

512-entryTLB

1024-entryTLB

4096-entryTLB

I IP

C (

%)

C IIPC for Vortex(dom1)-Specjbb(dom2)

Figure 5-1. Performance improvement for consolidated workloads with uncontrolled TLBsharing with 8-entry TMT and PW of 60 cycles. The performanceimprovement due to tagging for a domain clearly depends on the otherdomains which share the TLB.

110

• The effect of sharing the TLB is also apparent on considering the IIPC forTPCC-UVa (dom1) in TPCC-SPECjbb and TPCC-Vortex workloads as seenin Figures 5-1B and 5-1A respectively. In these workloads, dom1 is scheduledfor about 35% and 42% of the total execution time, for TPCC-SPECjbb andTPCC-Vortex respectively. Thus, it uses only a part of the TLB space and, unlikethe IIPC trend for TPCC-UVa when it is run alone (as seen in Figure 4-7A), theincrease in IIPC does not taper off with increase in TLB size beyond 256 entries.Even beyond this size, the TLB space used by TPCC is not sufficient to hold all itstranslations, as it has to be shared with the other workload.

• The higher TLB utilization of SPECjbb compared to Vortex, as discussedin Section 4.4.3, lowers TPCC-UVa’s IIPC in TPCC-SPECjbb compared toTPCC-Vortex for any given TLB size. Thus it is clear that the shared TLB usage isheavily influenced by the nature of the workloads which share it.

These observations clearly indicate that, in the absence of explicit controls, the

amount of shared TLB space used by a domain depends on the time for which the

domain is scheduled on a CPU, the working set size of the workload running in the

domain, and the workloads running on other domains which share the TLB. Clearly,

with even more VMs sharing the TLB, ”noise” in the performance of the workloads

will increase. This motivates the need for controlling the usage of the shared TLB by

different workload VMs as well as dom0.

5.2 Architecture of the CShare TLB

The CShare TLB architecture consists of the regular hardware-managed TLB with

two additional hardware tables: the Tag Manager Table (TMT), and the TLB Share Table

(TST). The TMT is responsible for enabling multiple address spaces to share the TLB

and has been discussed in depth in Chapter 4. The TST is used to control the shared

TLB usage amongst the different sharers.

The TLB Share Table (TST) is used for controlling the TLB usage, on a per-TLB

set basis, by choosing the victim during TLB replacement depending upon the current

usage of the different sharing classes. The sharing classes are the granularity at which

the TLB usage is controlled, and each class may consist of a process, a VM (as in this

work) or a collection of VMs. In this work we use the virtual machine as the sharing

111

class. Each entry of the TST, representing one sharing class, contains the TLB usage

restrictions for that class and has four fields:

• The SID field, which has the identifier of the sharing class. The use of SIDsprovides the flexibility of changing the granularity of the sharing classes whileincluding this SID as a part of the TMT entry provides a convenient mappingbetween the different processes and their sharing classes.

• The PRIORITY field, indicates the priority of the sharing class and is used todetermine the victim in situations where no sharing class has exceeded its usagelimits.

• The SHARE field, indicates the maximum number of entries per TLB set which canbe used by the sharing class.

• The CNT field, is used to store the number of entries in a set that are occupiedby the Sharing ID. Unlike the previous three fields, which are programmed by theVMM, the CNT field is updated by the hardware.

CR3

TAG

VITUAL ADDRESS

OFFSETCR3

CR3 SIDTAG

MANAGER

TABLE

1

CR3 SID

VA TAG

VASI

VASI

VASISID

TLB

LOOKUP

TRANSLATION

VA

TAG

PAGE

TABLE

VA TAG

VA TAG PHY

ADDR

CCR

PHY

ADDR

PHY

ADDR

SID SHARETLB

SHARE

TABLE

PRI

VICTIM SET

1

GET PER-SID

CNT FOR

THIS SET

2

USE PER-SID

SHARE, CNT

AND PRI TO

GET V-SID

3 LOOKUP CR3 TAG

4SELECT

VICTIM AND

REPLACE

CNT

Figure 5-2. Controlled TLB usage using CShare architecture. The victim evicted fromthe TLB is chosen depending on the allocations and current usages for thedifferent sharing classes.

The TLB Share Table is looked up only in the case of a TLB miss, as shown in

Figure 5-2, and, similar to the TMT, is not in the critical path of TLB lookups. The virtual

112

address is used to calculate the TLB set (victim set) in which the translation (new

entry) will be stored. The per-SID (per-VM) usage information of this set is obtained,

as shown in step ..1 of Figure 5-2, by counting the number of entries in that set and

storing it in the CNT fields of the appropriate sharing class in the TST. Based on these

CNT and SHARE values for the different classes, the SID to which the victim should

belong (V-SID) is calculated as shown in step ..2 of Figure 5-2. It should be noted that,

since the CNT and SHARE information are computed on a per-set basis, the time for

selecting the V-SID is small and can be overlapped with the page walk. Once the V-SID

is determined, a victim belonging to this sharing class is chosen from the victim set

using the regular TLB replacement heuristic (e.g. LRU). On completion of the page

walk, which proceeds in parallel to the selection of the victim, the obtained translation is

tagged with the CR3 tag of the current process from the CCR, as shown in Step ..3 of

Figure 5-2. The chosen victim is replaced with this translation as depicted in step ..4 .

The actual algorithm used in selection of V-SID depends on the motivation behind

controlling the usage of the TLB (performance isolation or performance enhancement).

When performance isolation is the goal, the TLB can be effectively partitioned among

the VMs by assigning a fixed number of TLB slots (SHARE values) to each VM, such

that the sum of these SHARE values does not exceed the total number of slots in the

TLB set. With such partitioning, any VM whose CNT value for a particular victim set is

less than its SHARE value is guaranteed to find at least one free slot in the set since

other VMs would not have exceeded their allotted SHAREs. That free slot is used for

caching the new entry. On the other hand, if CNT of VM1 is equal to the SHARE of VM1,

one of VM1’s entries in the set is evicted and that slot is used for caching the new entry.

Such a strict enforcement of the SHARE for different VMs, however, may not be

suitable when the motivation behind using CShare TLB is improving the performance of

a high priority workload and is not enforcing TLB isolation through TLB partitioning. For

instance, when the VM running the high priority workload has used all of its reserved

113

Table 5-1. Algorithms for selection of victim SID

FOR PERFORMANCE ISOLATION BY TLB RESERVATION

1) Count the slots used in the victim set for CCR.SID and store in

appropriate CNT

2) If CCR.SID.CNT >= CCR.SID.SHARE, V-SID = CCR.SID

3) Else:

3.1) Choose one of the (guaranteed) free slots and use it for

caching new translation

FOR PERFORMANCE ENHANCEMENT OF SELECTIVE WORKLOAD

1) If free slot available in the victim set, use it

2) Else:

Count the slots used in the victim set on a per-SID basis and

store in appropriate CNT

2a) If CCR.SID.CNT >= CCR.SID.SHARE, V-SID = CCR.SID

2b) Else:

2b.1) For SIDi ∀ SID in TST :

If SIDi.CNT > SIDi.SHARE, V-SID = SIDi

2b.2) If no V-SID, For SIDi ∀ SID in TST :

If SIDi is low priority & SIDi.CNT > 0, V-SID = SIDi

slots, it may borrow unused slots belonging to a VM which runs a lower priority workload

in order to reduce the miss rate and increase the performance of the high priority

workload. These slots may be reclaimed by the VM running the low-priority workload

when needed. Hence, when performance enhancement of selected high priority

workloads is the goal, the algorithm for selection of V-SID allows any VM, irrespective

of its usage limits to use any available free slots. The usage limitations and PRIORITY

values of different domains come into effect in deciding the V-SID only when no free slot

is available and some entry from the set has to be evicted to cache the new translation.

Both these algorithms are shown in Table 5-1.

114

5.3 Experimental Framework

The CShare TLB is modeled by augmenting the TMT model, described in

Section 4.3, with the TST. The size of the TST is set to match the number of entries

in the TMT. The functionality of the TST is verified by using a Functional Check mode

wherein, the number of TLB slots used by each SID is counted and ensured to be within

the specified limits during every TLB replacement.

The metrics used to study the impact of controlling TLB usage with the TST, are

similar to the metrics used in Chapter 3 and Chapter 4 and are presented here for

reference.

• Number of TLB flushes

• DTLB and ITLB miss rate and the Reduction in miss rate, where

Reduction (%) = 100×(1− TLB miss rate with tags

TLB miss rate without tags

)(5–1)

• Instructions per Cycle (IPC) and RIPC , IIPC and IF , where

RIPC = 100×(1− IPCRegular TLB

IPCIdeal TLB

)IIPC = 100×

(IPCCShare TLB

IPCRegular TLB

− 1

)IF = 100×

(IPCCShare TLB − IPCRegular TLB

IPCIdeal TLB − IPCRegular TLB

)(5–2)

5.4 Performance Isolation using CShare Architecture

In this section, the effect of using the CShare architecture to enforce partitions in the

TLB is investigated. The workloads used for this investigation are the TPCC-TPCC-0012

and TPCC-Vortex-0012. These workloads are created by simulating a two-processor

x86 machine using the experimental framework described in Section 3.2. Xen is

booted on this machine and two user domains (domUs) are created, with one virtual

CPU (VCPU) per domain . TPCC-UVa is run in the first domU (dom1) and, once the

application reaches its working phase, the domain is paused. Then, depending on

115

the required workload, TPCC-UVa or Vortex is launched in the second domU (dom2)

and allowed to reach its working phase. Then, dom1 is resumed and the VCPUs of

both dom1 and dom2 pinned to CPU1 of the Simics simulated machine. In addition the

VCPUs of dom0 are pinned to CPU0 of the simulated machine. Pinning the VCPUs in

this fashion ensures that only dom1 and dom2 are scheduled on CPU1 of the simulated

machine. Thus only the workloads on these domains will share the TLB of CPU1.

The performance isolation usage control policy, outlined in Table 5-1 is used to

partition TLB1 into two and allocate these partitions to dom1 and dom2 explicitly.

TPCC-TPCC-0012 is simulated using the framework described in Section 5.3 for various

TLB sizes and various TLB partition sizes. The DTLB and ITLB miss rates, expressed

as Misses per Thousand Instructions (MPKI), for 64-entry TLB and 512-entry TLB as

obtained from these simulations are presented in Figure 5-3. The miss rate is used as

the metric since it depends only on the shared TLB, which is being controlled using

CShare architecture, while the IPC depends on many other factors including the cache

and memory utilization of the workloads which are not being controlled.

From this figure, it is observed that the DTLB miss rate, has a strong dependence

on the size of the TLB space allocated to the domains. For instance, when 10% of the

TLB is reserved for dom1, its miss rate is almost 8× times the miss rate of dom2. The

lowest miss rate for both the domains is achieved when the share the TLB equally, as

both domains run the same workload and show similar TLB usage requirements. A

similar behavior is observed in the case of the ITLB miss rates. It should be noted that,

while both 64-entry TLB and 512-entry TLB are insufficient to capture the working set

size of TPCC-UVa and Vortex combined as seen from Section 4.4, the smaller size of

the 64-entry TLB cause the MPKI variation with partition sizes to be larger in magnitude

and smoother than the MPKI trends for 512-entry TLB. Thus, it is clear that the TST

serves as good control knob for controlling the TLB usage on a per-domain basis.

116

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 10 20 30 40 50 60 70 80 90 100

Mis

ses p

er T

ho

usan

d I

nstru

ctio

ns (

MP

KI)

dom1 (TPCC) dom2 (TPCC)

A DTLB miss rate for 64-entry TLB

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 10 20 30 40 50 60 70 80 90 100M

isses p

er T

ho

usan

d I

nstru

ctio

ns (

MP

KI)


B ITLB miss rate for 64-entry TLB

0

0.05

0.1

0.15

0.2

0.25

0 10 20 30 40 50 60 70 80 90 100

Mis

ses p

er T

ho

usan

d I

nstru

ctio

ns (

MP

KI)


C DTLB miss rate for 512-entry TLB

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0 10 20 30 40 50 60 70 80 90 100

Mis

ses p

er T

ho

usan

d I

nstru

ctio

ns (

MP

KI)


D ITLB miss rate for 512-entry TLB

Figure 5-3. Effect of varying TLB reservation on miss rate is shown by plotting the TLBmiss rate for TPCC-TPCC-0012 for varying allocation of the TLB space foreach domain. The miss rates of domains show a strong correlation with theirallocations.

117

0

1

2

3

4

5

6

7

TPC

C (d

om1)

in T

-T

TPC

C (d

om1)

in T

-V

TPC

C (d

om2)

in T

-T

Vorte

x (d

om2)

in T

-V

TPC

C (d

om1)

in T

-T

TPC

C (d

om1)

in T

-V

TPC

C (d

om2)

in T

-T

Vorte

x (d

om2)

in T

-V

TPC

C (d

om1)

in T

-T

TPC

C (d

om1)

in T

-V

TPC

C (d

om2)

in T

-T

Vorte

x (d

om2)

in T

-V

TPC

C (d

om1)

in T

-T

TPC

C (d

om1)

in T

-V

TPC

C (d

om2)

in T

-T

Vorte

x (d

om2)

in T

-V

TPC

C (d

om1)

in T

-T

TPC

C (d

om1)

in T

-V

TPC

C (d

om2)

in T

-T

Vorte

x (d

om2)

in T

-V

dom1 = 50% dom1 = 40% dom1 = 30% dom1 = 20% dom1 = 10%

Mis

ses

per

Thou

sand

Inst

ruct

ions

(MP

KI)

Figure 5-4. Miss rate isolation using the TMT architecture is shown by plotting theper-domain miss rates on a 64-entry CShare TLB for TPCC-TPCC-0012(T-T) and TPCC-Vortex-0012 (T-V). Despite the different demands on theTLB by dom2, the miss rate of dom1 is isolated from the influence of dom2.

To show that the usage ”control knob” property of the TST can be used to isolate

the TLB miss rates of workloads, the simulations are repeated for TPCC-Vortex-0012

workload. The per-domain DTLB miss rates for a 64-entry TLB for both TPCC-TPCC-0012

and TPCC-Vortex-0012 for a range of partition sizes, as obtained from these simulations,

are shown in Figure 5-4. When the per-domain miss rates are considered for TPCC-TPCC-0012,

since both domains run the same workload, they exhibit similar miss rates of about 0.61

MPKI when allocated equal shares in the TLB (dom1=50%). On reducing the TLB

usage limit for dom1 and allocating a larger share of the TLB for dom2, the miss rates

for these domains begin to show differing trends. At dom1=10%, with dom2 allowed

90%, the miss rate of TPCC-UVa on dom1 is 4.07 MPKI which is almost an order of

magnitude greater than the miss rate of TPCC-UVa on dom2. A similar trend is seen for

the consolidated workload TPCC-Vortex with the main difference being that, the miss

118

rate for Vortex is much larger than the miss rate for TPCC-UVa even when it is allocated

a larger portion of the TLB due to its memory intensive behavior.

Since Vortex is more ”TLB hungry” than TPCC-UVa, the miss rate of TPCC-UVa will

be increased when it is consolidated with Vortex in the absence of any usage control.

However, from Figure 5-4, it is seen that the miss rate of TPCC-UVa running on dom1

in both the consolidated workloads is very close and depends only on the portion of

the TLB that is reserved for it. It is also seen that the miss rate of dom1 is independent

of the workload running on dom2, clearly indicating the efficacy of the CShare TLB in

isolating the TLB miss rate of one domain from the influence of other domains.

5.5 Performance Enhancement Using CShare Architecture

In addition to isolating the TLB behavior of an application running on a VM from

other VMs running on the same platform, the CShare architecture may also be used

to further improve the performance increase achieved by using the TMT. Different

applications with varying working set sizes and memory access patterns exhibit

correspondingly varying patterns in the usage of the TLB space. By controlling the

TLB space and regulating the amount of TLB space used by every VM based on its

memory access pattern, it becomes possible to achieve a lower TLB miss rate and

improve the performance of the workloads.

5.5.1 Classification of TLB Usage Patterns

Typical multimedia application exhibit a ”streaming” memory access pattern, where

the data accessed from the main memory show regularity in the stride of access [111].

In such applications, the number of data accesses per instruction is typically very high,

and there is little reuse in the accessed data. Applications which exhibit such memory

behavior are termed as ”streaming applications” in this dissertation.

To understand the TLB implications of streaming applications, several workload

applications are simulated on the domU of a uniprocessor x86 machine with the CShare

119

TLB, without explicit TLB usage control and an 8-entry TMT, using the framework

described in Chapter 3. The selected applications are:

• Vortex: a memory intensive database manipulation workload [77].

• TPCC-UVa: an I/O intensive implementation of the TPC-C benchmark from theSPEC CPU 2000 suite of benchmarks [82].

• Apsi: a weather prediction program which reads a 112 × 112 × 112 array of dataand iterates over 70 timesteps [112].

• Art: A neural network program used for object recognition in thermal imagery [113].

• Lucas: A program to check the primality of Mersenne numbers of the form 2n −1 [114].

• Swim: a compute intensive floating point program for shallow water modeling witha 1335× 1335 array of input data [115].

The DTLB miss rates for the domU running these applications is observed for

varying TLB sizes. These miss rates, normalized to the miss rate for a 64-entry TLB,

is presented in Figure 5-5A. From this, it can be seen that increasing the TLB size

does not reduce the TLB miss rate to the same extent in all applications. For instance,

TPCC-UVa and Vortex show significant benefit from the increase in TLB size. However,

Apsi and Art show smaller reduction in DTLB miss rate of about 20% till a TLB size of

256 entries and 512 entries respectively. Beyond these TLB sizes, the TLB miss rate

rapidly reduces to less than 5% of the 64-entry TLB miss rate. Yet another trend of the

TLB miss rates is exhibited by Swim and Lucas. In these workloads, there is little benefit

of scaling up the TLB size and even at a large TLB size of 1024-entries, the DTLB miss

rate is not highly reduced. For instance, at this TLB size, the miss rate of Swim is 98.8%

of the 64-entry DTLB miss rate. From this trend, the applications can be classified, in a

manner similar to previous works [18], into three categories:

• Type 1 Applications such as TPCC-UVa and Vortex, which have a smaller workingset size and show good reuse in the access pattern. These workloads arecharacterized by a concave parabolic response of the normalized DTLB missrate to increasing TLB sizes. In such applications, increasing the TLB size reduces

120

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%


DT

LB

Mis

ses p

er T

ho

usan

d I

nstru

ctio

ns n

orm

ali

zed

to

th

e

64-en

try D

TL

B M

PK

I

Swim Apsi Lucas Art TPCC Vortex

A DTLB Miss Rate for domU running the work-load application

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

64-entry TLB 128-entry TLB 256-entry TLB 512-entry TLB 1024-entry TLBIT

LB

Mis

ses p

er T

ho

usan

d I

nstru

ctio

ns n

orm

ali

zed

to

64-en

try I

TL

B M

PK

I


B ITLB Miss Rate for domU running the work-load application

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%


DT

LB

Mis

se

s p

er T

ho

us

an

d I

ns

tru

ctio

ns

no

rm

ali

ze

d t

o t

he

64

-e

ntry

DT

LB

MP

KI


C DTLB Miss Rate for dom0

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%


Do

m0

IT

LB

Mis

se

s p

er T

ho

us

an

d I

ns

tru

ctio

ns

no

rm

ali

ze

d t

o t

he

64

-

en

try

IT

LB

MP

KI


D ITLB Miss Rate for dom0

Figure 5-5. Classification of TLB usage patterns. Applications can be classified into oneof three types depending on the reduction in miss rate upon increasing theTLB size.

121

the miss rate even when he size is insufficient to accommodate the entire workingset size.

• Type 2 Applications such as Apsi and Art, which have a small to medium workingset size, but show relatively less reuse in the access pattern. The normalizedDTLB miss rates of these applications show a convex parabolic trend to increasingthe TLB size. As long as the TLB size is not sufficient to accommodate the workingset size, there is little benefit to increasing the TLB size since the reuse of entriesis not very high. However, once the TLB size is large enough to capture the entireworking set size, the DTLB miss rate reduces significantly.

• Type 3 Applications such as Swim and Lucas which are streaming applications.Any increase in the TLB size does not significantly reduce the DTLB miss rate.

The ITLB miss rate, on the other hand, for all these applications exhibit similar

response to increasing the TLB size, as seen from Figure 5-5B. Simply doubling the

TLB size from 64 entries to 128 entries reduces the ITLB miss rate of all the applications

by at least 40%. In the case of Vortex and Apsi, this reduction is as high as 90% and

80% respectively. Intuitively, while the instruction footprint of different applications may

vary, the behavior of the memory access for fetching instructions is similar across

applications. Thus, as far as the ITLB is concerned, all applications exhibit Type 1

behavior. Similarly, the DTLB and ITLB miss rates for dom0 also exhibit Type 1 behavior

as both the code and data working set sizes on dom0, which are due to the backend

drivers, are small and show good reuse.

From these observations, it is clear that the benefit of awarding more TLB space

to an application or the penalty of withholding TLB space from an application is highly

dependent on the TLB usage pattern of the workload application.

5.5.2 Performance Improvement With Static TLB Usage Control

The idea behind improving the performance of workloads using TLB usage control

is to give a larger TLB space to those workloads which make better use of the awarded

space and to restrict the TLB space for those applications which do not make good

use of the TLB space. The TLB usage by different domains is controlled using the

TLB usage control policy for performance enhancement listed in Table 5-1. The usage

122

restrictions for each domain is specified as the maximum percentage fraction of the

CShare TLB that can be used by that domain. It should be noted that in this dissertation,

the notation X-Y-Z is used to represent a static TLB usage scheme where X%, Y% and

Z% of the entries in the TLB set are the usage restrictions, and therefore the SHARE

values, for dom0, dom1 and dom2 respectively. Since the usage control policy is static,

these usage control restrictions for the different domains are set at the beginning of the

experiment and are maintained constant throughout.

To demonstrate the benefit of TLB usage control in improving workload performance,

consolidated workloads TPCC-Vortex and TPCC-Lucas are run on a simulated

uniprocessor x86 virtualized platform with CShare TLBs of varying sizes and 8-way

associativity. dom1, which runs TPCC-UVa is set to be the high priority domain and is

allowed to use 100% of the TLB space. The usage restrictions for the low priority dom0

and dom2 (running either Lucas or Vortex) are set to be either 20%, 40%, 60%, 80% or

100%. In addition to these usage control schemes, a completely uncontrolled scheme

where all domains are given equal priority and are allowed to use the entirety of the TLB

space is also investigated.

The DTLB and ITLB miss rates as well as the Impact Factor (IF ) from these

simulations are presented in Figure 5-6 and Figure 5-7. From these, it can be observed

that statically allocating a higher TLB space to TPCC-UVa and lower TLB space to

dom0 and dom2 has different effects on both the consolidated workloads. As far as

TPCC-Vortex is concerned, both the workload domains, as well as dom0 exhibit Type 1

TLB behavior. Thus, restricting the TLB usage of dom0 and dom2 result in an increase

of the DTLB miss rate as seen in Figure 5-6A. This increase is much higher at smaller

TLB sizes of 64 entries, as the TLB space is under high contention in this TLB size

range. However, on increasing the TLB size to 1024 entries, the change in DTLB miss

rate with varying usage restrictions for dom0 an dom2 become small. The important

point is that, at no static TLB usage control scheme is the DTLB miss rate smaller than

123

0

0.5

1

1.5

2

2.52

0%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%

256-entry TLB 512-entry TLB 1024-entry TLB

Dom0 Limit

DT

LB

Mis

ses p

er T

ho

usan

d I

nstru

ctio

ns (

MP

KI)

Dom2 Limit

20%

40%

60%

80%

100%

Uncontrolled usage, No limits and priority for any Dom

A DTLB miss rate for TPCC-Vortex

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1.45

1.5

20

%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%


Dom0 Limit

DT

LB

Mis

ses p

er T

ho

usan

d I

nstru

ctio

ns (

MP

KI)

Dom2 Limit

20%

40%

60%

80%

100%


B DTLB miss rate TPCC-Lucas

0

0.05

0.1

0.15

0.2

0.25

0.3

20

%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%


Dom0 Limit

ITL

B M

isses p

er T

ho

usan

d I

nstru

ctio

ns (

MP

KI)

Dom2 Limit

20%

40%

60%

80%

100%


C ITLB miss rate for TPCC-Vortex

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

20

%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%


Dom0 Limit

ITL

B M

isses p

er T

ho

usan

d I

nstru

ctio

ns (

MP

KI)

Dom2 Limit

20%

40%

60%

80%

100%

Uncontrolled, No limits and priority for any Dom

D ITLB miss rate for TPCC-Lucas

Figure 5-6. Overall miss rate improvement for consolidated workload with static TLBusage control. Except for the curve marked uncontrolled usage, dom1 is setat high priority with 100% usage limit.

124

the uncontrolled usage scheme wherein each domain uses as much TLB space as it

needs by evicting the older entries belonging to other domains. Even when all domains

are allowed to use 100% of the TLB space, the effective replacement policy is not purely

LRU but LRU weighted with the priorities of the various domains. Thus the DTLB miss

rate at 100% − 100% − 100% is smaller than the uncontrolled usage scenario. It is also

interesting to note that, at 512-entry and 1024-entry TLB sizes, increasing the usage limit

for dom0 while maintaining the limit for dom2, increases the miss rate. This is an artefact

of the usage control policy, especially Step 2b in the algorithm in Table 5-1.

A similar phenomenon of the uncontrolled usage resulting in lower miss rate than

any static usage control scheme is seen in the ITLB miss rates, as shown in Figure,

since all ITLB trends exhibit Type 1 behavior. Thus, due to the trends in both TLB and

ITLB miss rates, the IF is highest for the uncontrolled unmanaged usage control policy

than for any other static reservation policy, as seen from Figure 5-7A. In fact, for 64-entry

TLB and 512-entry TLB, the IF , which is a measure of the TLB delay as explained in

Section 4.4.3, falls to as much as −100%, indicating that the TLB delay is doubled at

those usage control settings.

The impact of usage control on TPCC-Lucas workload, on the other hand, is quite

different from the impact on TPCC-Vortex. As Lucas is a Type 3 streaming workload,

as far as the DTLB is concerned, withholding TLB space from it does not significantly

increase the TLB miss rate. Thus, at usage control schemes where the limit for dom2

is set to low values such as 20% and 40%, dom0 and dom1 benefit from this additional

TLB space and show a lower DTLB miss rate than the uncontrolled usage scheme

as seen from Figure 5-6B. The result of this behavior is reflected in the IF trends for

TPCC-Lucas, where setting a 20% restriction for Lucas results in increasing the IF from

20% to 25% for a 512-entry TLB. The ITLB miss rate trends displayed in Figure 5-6D,

however, are the same as for TPCC-Vortex.

125

-120

-100

-80

-60

-40

-20

0

20

40

60

80

20

%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%


Dom0 Limit

IF (

%)

Dom2 Limit

20%

40%

60%

80%

100%


A IF for TPCC-Vortex

0

5

10

15

20

25

30

20

%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%

20

%

40

%

60

%

80

%

10

0%


Dom0 LimitIF

(%

)

Dom2 Limit

20%

40%

60%

80%

100%


B IF for TPCC-Lucas

Figure 5-7. Overall performance improvement for consolidated workload with static TLBusage control. Except for the curve marked uncontrolled usage, dom1 is setat high priority with 100% usage limit.

From these simulations, the following conclusions can be drawn regarding the miss

rate and overall workload performance when the usage restrictions are statically set for

consolidated workloads.

• Independent of the type of workload, the ITLB with uncontrolled and unrestrictedsharing performs better than any static usage control scheme. This suggests that,for maximum performance, the ITLB should not be managed using static usagecontrol policies.

• The benefit of static usage control schemes depend on the composite applicationswhich are consolidated in the workload. Specifically, restricting the usage for aType3 application to increase the space available for Type1 applications results in asmaller DTLB miss rate and larger IF for the consolidated workload as a whole.

• For consolidated workloads such as TPCC-Vortex, where all domains exhibitType1 behavior, using priorities in the replacement policy will result in lower missrate, even if all domains are allowed to use the entire TLB space, compared tousing pure LRU without any notion of usage control or priorities.

126

5.5.3 Selective Performance Improvement With Static TLB Usage Control

The previous section examined the effect of static TLB usage control on the

performance improvement of consolidated workloads. From Figure 5-7, it was evident

that, as far as the IF for the entire consolidated workload was concerned, static usage

control policies were beneficial only when one of the restricted domains was a TLB

insensitive streaming workload. However, the motivation behind TLB usage control could

be to improve the performance of one selected high priority workload domain and not

the entire consolidated workload. The use of the CShare architecture to achieve this is

explored in this section.

To examine this, the consolidated workloads TPCC-Vortex and TPCC-Lucas are

simulated on a 1-cpu x86 machine with a CShare TLB of varying sizes and the V-SID

selection for performance enhancement algorithm shown in Table 5-1 used during TLB

misses. The same static usage control schemes explored in the previous section are

utilized here. In each of these schemes, except for the uncontrolled usage scheme,

dom0 and dom2 are set as the low priority domains while dom1 running TPCC-UVa is

set as the high priority domain. The per-domain IIPC for the workloads are observed from

these simulations. The IIPC trends for TPCC-Vortex and TPCC-Lucas for 512-entry TLB

as well as 1024-entry TLB sizes are presented in Figure 5-8.

When the IIPC variation for dom0 is considered, there is marked change in the IIPC

with the TLB usage limit imposed upon it. This trend in IIPC for various usage control

schemes is independent of the workload running on dom2, as dom0 mainly runs the

code for servicing TPCC-UVa’s I/O requests. When dom0’s usage is restricted to a

maximum of 20% of the total TLB space (20− 100− 20), the IIPC decreases by a factor of

0.83× for TPCC-Vortex and 0.81× for TPCC-Lucas. Moreover, since the V-SID selection

algorithm is not geared for performance isolation, the impact of altering the usage

limitations on dom2 is reflected in the IIPC values of dom0, as seen from the reduction

127

-12

-10

-8

-6

-4

-2

0

2

4

6

8

10

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

Dom0 Dom1 (TPCC) Dom2 (Vortex)

I IP

C (

%)

A IIPC for TPCC-Vortex, 512-entry TLB

0

0.5

1

1.5

2

2.5

3

3.5

4

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

Dom0 Dom1 (TPCC) Dom2 (Lucas)

I IP

C (

%)

B IIPC for TPCC-Lucas, 512-entry TLB

0

1

2

3

4

5

6

7

8

9

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

Dom0 Dom1 (TPCC) Dom2 (Vortex)

I IP

C (

%)

C IIPC for TPCC-Vortex, 1024-entry TLB

0

0.5

1

1.5

2

2.5

3

3.5

4

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

No

Co

ntr

ol

20

-1

00

-2

0

20

-1

00

-6

0

20

-1

00

-1

00

60

-1

00

-2

0

60

-1

00

-6

0

60

-1

00

-1

00

10

0-1

00

-2

0

10

0-1

00

-6

0

Dom0 Dom1 (TPCC) Dom2 (Lucas)

I IP

C (

%)

D IIPC for TPCC-Lucas, 1024-entry TLB

Figure 5-8. Selective performance improvement for consolidated workload with staticTLB usage control with PW of 60 cycles. Except where marked asNoControl, dom1 (TPCC-UVa) is given higher priority while dom0 (backenddrivers) and dom2 (Vortex and Lucas in TPCC-Vortex and TPCC-Lucasconsolidated workloads respectively) are set at lower priority.

128

in IIPC by 0.54× and 0.38× for control schemes 20 − 100 − 60 and 20 − 100 − 100 for

TPCC-Vortex.

The trend in the IIPC value for dom2, on the other hand, is highly dependent on

whether the workload is Vortex, which significantly reuses the cached TLB entries and

therefore is sensitive to change in TLB size, or Lucas which is has low sensitivity to TLB

size due to the streaming nature of its memory access and little reuse of the cached TLB

entries. For instance, when Vortex is run on dom2, restricting the TLB space for Vortex

severely impacts the IIPC value. When the usage limit for dom2 is set at 20% as in usage

scheme 20 − 100 − 20, the IIPC attains a value of −8.4%, compared to the 5.1% for the

uncontrolled usage scenario. This indicates that, in spite of having the process-specific

tagging, the sheer lack of TLB space drives the performance of Vortex lower than the

performance in the case of un-shared TLB and the effect of avoiding the TLB flushes

is nullified. In addition to the high priority dom1, when dom0 is also allowed to use the

entire TLB space (usage scheme 100-100-20), the reduction in IPC further worsens

and is almost 10% (IIPC is −10%). However, with 60% usage limit for Vortex, the IIPC

value bounces back to 3.9% - 3.7%. While Vortex’s performance at this usage limit is

definitely less than with uncontrolled usage, it is higher than the performance that can

be obtained without CShare TLB. On the other hand, when Lucas runs as the workload

in dom2, the effect of depriving it of TLB space is markedly different from Vortex due

to its low sensitivity to TLB size. The lowest value of dom2’s IIPC , occurring at usage

control scheme 100 − 100 − 20, is 0.34 compared to the IIPC of 0.47 without any usage

control. The important difference with Vortex is that, at no usage scheme does Lucas

exhibit a negative IIPC , indicating that the performance with CShare TLB is higher than

the performance with regular TLB, even with a restricted TLB usage.

The behavior of the high priority TPCC-UVa workload on dom1 shows an interesting

trend in IIPC for different TLB usage control schemes, as seen from Figures 5-8A

and 5-8B. When run consolidated with Vortex, the IPC increases under any usage

129

scheme compared to the uncontrolled sharing scheme. The highest IIPC is seen

when the usage of both dom0 and dom1 are restricted to 20%. In this scheme,

TPCC-UVa’s IIPC increases by a factor of 1.4× compared to uncontrolled usage scheme.

However, especially in the case of TPCC-Vortex, setting a usage control scheme of

20 − 100 − 20 proves extremely expensive on the performance of Vortex. Increasing

dom2’s usage limit to 60% reduces the penalty imposed on dom2’s performance while

ensuring that the IPC of TPCC-UVa on dom1 is still higher than uncontrolled sharing.

With TPCC-Lucas, on the other hand, TPCC-UVa’s IIPC is actually smaller than the

uncontrolled sharing scheme when Lucas is allowed to use the entire TLB , due to the

streaming nature of Lucas’ memory access.

It can also be observed that usage control on the IIPC of dom1 significantly reduces

at larger TLB size of 1024-entry for TPCC-Vortex, however is still pronounced for

TPCC-Lucas. At this TLB size, the working set size of both TPCC-UVa as well as Vortex

can be accommodated in the TLB and awarding a larger share of the TLB for dom1

does not pay significant dividends. On the other hand, even a 1024-entry TLB is not

sufficient to hold its working set when consolidated with TPCC-UVa. Restricting Lucas’s

TLB usage, even at a large TLB size of 1024 entries, improves IIPC , and therefore the

performance, of TPCC-UVa.

From these simulations, it is observed that a usage control setting of 20 − 100 − 60

for TPCC-Vortex with a 512 entry TLB causes an IF of 62% for TPCC-UVa, implying

that 62% of the TLB-induced delay in TPCC-UVa can be eliminated by using the

CShare TLB. Similarly, for TPCC-UVa in TPCC-Lucas, an IF of 52% is observed for

512-entry TLB under this usage control scheme. These IFs translate to an increase in

TPCC-UVa’s IPC by about 3.5% at PW latency of 60 cycles and 16.5% at PW latency of

270 cycles.

From these analysis, the following observations can be deduced about controlled

TLB sharing using the CShare architecture for selective performance enhancement:

130

• The impact of usage control is pronounced as long as the TLB is insufficient tocapture the working set of all the workloads which share it, i.e. when the TLB is aresource of contention.

• When the TLB behavior of the low-priority workload is dependent on the size of theTLB, such as Vortex, restricting its TLB usage reduces its IPC by a larger value,than it increases the IPC of the high-priority application

• When the low-priority application exhibits streaming type of memory access,with low reuse of the cached TLB . entries, limiting the TLB space for thisapplication increases the IPC of the high priority application by a larger valuethan the reduction in the low-priority application’s IPC.

5.5.4 Performance Improvement With Dynamic TLB Usage Control

From Section 5.5.3, it is evident that the cost of selectively enhancing the performance

of a high priority workload, i.e. the reduction in the performance of the low priority

workload, depends on the nature of the workload. For workloads such as Lucas, the

cost is smaller than the increase in the high priority workload performance. However, for

TLB sensitive applications such as Vortex, the cost outstrips the performance benefit.

As a result, the overall performance of the consolidated workload reduces as seen from

Figure 5-7A.

However, the TLB usage of many TLB-sensitive applications have distinct phases:

some where the pressure exerted on the TLB is quite high and some phases where

the TLB usage is low. Unlike a static usage control policy, as used in Section 5.5, a

dynamic usage control policy will be able to exploit these different phases by temporarily

allocating a larger share of the TLB for the low-priority application when it is in a high

TLB usage phase and restricting the TLB usage only in low TLB usage phases.

In order to implement such dynamic usage policies, a phase analyzer functionality

is added to the CShare TLB as shown in Figure 5-9. The phase analyzer architecture

consists of a bank of registers, similar to the performance monitoring units (PMUs).

These registers are used to track the miss rate of the TLB on a per-SID basis, in a

fashion similar to the PMUs for tracking cache statistics in current processors [28], as

shown in step ..1 . It also consists of a countdown timer which can be used to set the

131

SID

TLB MISS CNTSID

PHASE

ANALYZER

COUNTDOWN TIMER

VASISID

TLB

VPN PPN

SID SHARE

TLB SHARE TABLE

PRI CNT

CR3 SID

TAG

MANAGER

TABLE

CR3

VASI

VASI

CCR

2PHASE ANALYZER

FUNCTIONALITY

IS INVOKED

1

TLB MISSES

ARE TRACKED

3SHARE FOR

SID IS RESET

Figure 5-9. Dynamic TLB Usage Control with a Phase Analyzer. TLB misses are trackedas shown in step ..1 . When the phase analyzer functionality is invoked atprogrammed intervals, as shown in step ..2 , the miss rate over the pastinterval is calculated and used to adjust the SHARE value for the sharingclasses, as shown in step ..3 .

frequency at which the phase analyzer functionality is invoked. This timer is set to the

desired value and is decremented on every clock tick. Once the timer reaches zero, and

the next capacity or forced flush occurs, the phase analyzer functionality is triggered.

The idea behind incorporating the phase analysis functionality as a part of the TLB flush

behavior is to avoid the gratuitous flushing of the TLB after reallocation.

On invocation, as shown in step 2 of Figure 5-9, the phase analyzer examines the

current usage of the TLB by calculating the TLB miss rate since the last invocation. It

then uses this miss rate and the past history of the miss rate change the TLB usage

limit of the low priority domain s shown in step ..3 . For instance, If the trend in the miss

rate is increasing, the SHARE value of the low priority workload domain is increased

compared to the current allocation. If the miss rate of the current phase, however, is

lower than the previous phase, the usage of the low priority workload domain is further

132

restricted. To implement this functionality. The number of entries in the TST decide the

number of registers in this bank.

-120

-100

-80

-60

-40

-20

0

20

40

60

80N

oR

es

10

0-1

00

-20

10

0-1

00

-60

Dyn

am

ic

No

Re

s

10

0-1

00

-20

10

0-1

00

-60

Dyn

am

ic

No

Re

s

10

0-1

00

-20

10

0-1

00

-60

Dyn

am

ic

Dom0 Dom1 Dom2

IF (

%)

Figure 5-10. Selective performance improvement for consolidated workload with staticTLB usage control for a 512-entry 8-way CShare TLB. Dynamicallychanging the TLB usage restrictions of the low-priority workload domain(dom2) significantly reduces the cost of selectively enhancing theperformance of high priority workload domain (dom1) and improves theoverall performance of the consolidated workload.

In order to demonstrate the advantage of dynamic TLB usage control policies,

TPCC-Vortex is simulated using the same setup outlined in Section 5.5.3 with the

addition of the phase analyzer module. The countdown timer is programmed with a

value of five million cycles as this approximates the frequency of forced flushes for the

TPCC-Vortex workload. The per-domain and overall performance statistics are observed

for various CShare TLB sizes. From these observations, the IF for the dom0 as well as

the workload domains, for a 512-entry TLB, are presented in Figure 5-10.

133

From this figure, it can be clearly seen that dynamically managing the TLB usage

of Vortex running on dom2 significantly reduces the cost of selective performance

enhancement. For instance, at a static usage restriction of 100 − 100 − 20, where the

lower priority dom2 is restricted to use only 20% of the TLB while the higher priority

workload dom1 running TPCC-UVa as well as the driver domain dom0 are allowed to

use the entire TLB space, the IF of TPCC-UVa increases from 47% to 63%. However,

the cost of this increase is an IF of −110% for dom2. In other words, the delay due

to the TLB misses and page walks for dom2 when such a static restriction is used is

more than twice the delay of the untagged TLB. The benefit of using the tagged TLB,

which is a lowering of the TLB delay by 56% in the uncontrolled case, is more than

offset with static usage restrictions. Even at 60% usage restriction for dom2, the cost in

terms of the lowering of IF compared to the uncontrolled case is about 15%. However,

with dynamic control using the phase analyzer, the cost is reduced to 4% which the

benefit in terms of the IF for dom1 increases by 14% from the uncontrolled case.

These translate into IIPC values of 3.59% and 4.87% for dom0 and dom1 respectively,

about 1.3× and 0.96× the IIPC without explicit TLB usage controls. Moreover, while

not shown in the figure, the IF of the overall consolidated workload increases by

about 2%. Thus, with dynamic usage control it becomes possible to achieve selective

performance enhancement for TPCC-UVa running on dom1 without significantly

lowering the performance of the lower priority dom2.

5.6 Summary

In this chapter, the CShare TLB is proposed for enabling the sharing of the

TLB using process-specific tagging in a controlled manner. The TLB usage control

mechanism in the CShare TLB can be used for isolating the TLB performance of various

domains which share a TLB by explicitly reserving portions of the TLB for different

domains. Moreover, by statically partitioning the TLB space to restrict the TLB usage

for a low priority domain, the performance of the high priority domain can be increased.

134

This is accompanied by an increase in the overall consolidated workload performance

if the low priority domain being restricted exhibits a TLB-insensitive streaming usage

pattern. However, if the low priority domain is TLB sensitive, the cost of restricting its

TLB usage can be significant, even to the extent of reducing the overall performance

of the consolidated workload. This cost can be reduced by using dynamic TLB usage

control policies to restrict the TLB usage of the low priority domain only during phases

where the TLB usage is not high. Using such usage control, the performance increase

for a high priority workload domain achieved by using an uncontrolled process-specific

tagged TLB can be selectively increased by about 1.4×.

135

CHAPTER 6CONCLUSION AND FUTURE WORK

Improving the performance of virtualized workloads and managing the sharing

of resources among the component applications of consolidated workloads are two

challenges in virtualization. Meeting these challenges, specifically in the context of

hardware-managed Translation Lookaside Buffers (TLBs), forms the theme of this

dissertation.

In order to understand the performance degradation caused by the high-frequency

TLB flushing on virtualized platforms and to investigate the impact of various schemes

that are proposed to reduce the TLB-induced delay, simulation frameworks supporting

detailed and customizable performance and timing models for the TLB are needed. To

address this issue, a full-system simulation framework supporting x86 ISA and TLB

models is developed, validated and used to experimentally evaluate the performance

implications of the TLB in virtualized environments. The tagged TLB model developed

in this work is designed to be generic enough to support the simulation of both

process-specific as well as VM-specific tagging. This is the only academic simulation

framework that provides a detailed timing model for the TLB and simulates the walking

of page tables on a TLB miss. Moreover, this framework is capable of simulating

multiprocessor multi-domain workloads, which makes it uniquely suitable for studying

virtualized platforms. Using this framework, the TLB behavior of I/O-intensive and

memory-intensive virtualized workloads is characterized and contrasted with their

non-virtualized equivalents. It is shown that, unlike non-virtualized single-O/S scenarios,

the adverse impact of the TLB on the workload performance is significant on virtualized

platforms. Using the developed simulation framework, it is shown that this performance

reduction for virtualized workloads is as much as 35% due to the TLB misses which are

caused by the repeated flushing of the TLB and the subsequent page walks to service

these misses.

136

This dissertation proposes a novel microarchitectural approach called the Tag

Manager Table (TMT) to reduce the TLB-induced performance delay for virtualized

workloads. The TMT approach involves tagging the TLB entries with tags that are

process-specific, thus associating them with the process which owns them. By tagging

the TLB entries, TLB flushes can be avoided during context switches. The TMT is

designed to generate and manage these tags in a software-transparent fashion

while ensuring low-latency of TLB lookups and imposing a small area overhead.

Using the simulation framework developed in this dissertation, It is found that using

process-specific tags reduces the TLB miss rate by about 65% to 90% which, depending

on the TLB miss penalty, translates into a 4.5% to 25% improvement in the performance

of the workloads. The architectural parameters and workload dependent factors that

influence the performance benefit of using the TMT are investigated and prioritized on

the basis of the significance of their influence.

Since the tags are generated at a process-level granularity and are not tied to

any virtualization-specific aspect, the TMT may be used to avoid TLB flushes in

non-virtualized scenarios as well. Moreover, the TMT may also be used to enable

TLB sharing across multiple per-core private TLBs using a hierarchical design with a

shared Last Level TLB (LLTLB), which reduces the TLB miss rate by 15% to 28% due to

a better utilization of the TLB space. The use of the Tag Manager Table in tagging I/O

TLBs is proposed and validated using a full-system simulation-based prototype.

The third part of this dissertation addresses the issue of usage control in the tagged

TLB which, because of the tagging, is shared amongst multiple processes. The CShare

TLB architecture is proposed to control the TLB sharing. The TLB usage of different

applications is analyzed and classified depending on how well they use the TLB space.

Based on this, the performance improvement due to the TMT without any explicit usage

controls is further increased by using the CShare architecture to provide a larger TLB

space to those applications which have a higher priority and to restrict the TLB usage

137

of TLB-insensitive applications. The use of dynamic TLB usage control policies to

provide this further performance improvement, even when the restricted workload is

TLB sensitive, is investigated. Using such control, the performance increase for a high

priority workload domain achieved by using an uncontrolled process-specific tagged

TLB can be selectively increased by about 1.4×. The use of the CShare architecture

in ensuring TLB performance isolation amongst domains which share the TLB is also

explored.

While the Tag Manager Table is motivated by the need to improve performance in

virtualized scenario, process-specific tagging of the TLB entries is key to enabling many

architectural features which are common on RISC architectures with software-managed

TLBs and which depend on the ability to associate TLB entries with the address space

for which they are valid. Using the TMT-generated process-specific tags creates these

associations in platforms with hardware-managed TLBs, like x86, and enable the

adoption of ideas such as coherent TLBs and virtual caches on these platforms. The

work presented in this dissertation forms the foundation for such future exploratory

research.

138

APPENDIX AFULL FACTORIAL EXPERIMENT

A Full Factorial Experiment is an experimental technique to understand the effect

of various parameters on the output of a system. In such experiments, there are two or

more factors, each of which can take one of many discrete levels. These factors act as

the input to the system under test. One experiment is performed for each combination

of the factors. By examining the output for these different combination of the factors, the

effect of the factors and their interactions on the response variable can be studied.

In a full factorial experiment, the response variable yijk for the k th repetition of the

experiment (out of a total of r repetitions) with factors A at the j th of a possible levels and

factor B at the i th of b possible levels, is given by

yijk = µ+ αj + βi + γij + eijk (A–1)

Here µ is the mean value of the response variable, αj the effect of factor A at level j , βi

the effect of factor B at level i and γij the effect of interaction between A at level j and B

at level i . eijk is the error term.

The observations from the full factorial experiment are arranged in a two-dimensional

matrix of cells with b rows and a columns. The (i , j)th cell contains the observations

belonging to the r repetitions for the experiment with A and B at levels j and i respectively.

Averaging the values in each cell, across columns, across rows and across all the

observations produces

�yij . = µ+ αj + βi + γij

�yi .. = µ+ βi

�y.j . = µ+ αj

�y... = µ (A–2)

139

From these equations, the effects can be calculated as

µ = �y...

αj = �y.j . − �y...

βi = �yi .. − �y...

γij = �yij . − �yi .. − �y.j . + �y...

eijk = yijk − �yij . (A–3)

The variation of the output variable can be allocated among the two factors and their

interaction by squaring both sides of Equation A–1, and assigning the different terms the

notations shown in Equation A–4

∑ijk

y 2

ijk = abrµ2 + brσjα2

j + arσiβ2

i + rσijγ2

ij + σijke2

ijk

SSY = SS0 + SSA+ SSB + SSAB + SSE (A–4)

From these values, the percentage variation due to factors A and B, the interaction

AB as well as an unexplained part due to experimental errors are calculated as shown

in Equation A–5

SST = SSY − SS0

= SSA+ SSB + SSAB + SSE

%VariationA = 100× SSA

SST

%VariationB = 100× SSB

SST

%VariationAB = 100× SSAB

SST

%VariationErr = 100× SSE

SST(A–5)

When the number of factors involved become large, as in Chapter 4, the estimation

of the significance of each factor can be computed using statistical software such as

SAS [116].

140

APPENDIX BFULL FACTORIAL EXPERIMENTS USING THE SIMULATION FRAMEWORK

A typical form of simulation-based studies is parametric sweeps. Such studies,

similar to the experiments detailed in Section 4.4, consist of running a large number of

long running simulations with varying key parameters for each simulation run. Typically

such large running simulation jobs are performed on dedicated cluster resources or on

distributed grids. This appendix provides the details of setting up the simulation runs on

a typical cluster as well as on a wide area grid.

The dedicated cluster on which the simulations are run is the University of Florida

High Performance Computing Cluster [117]. The HPC consists of a centralized Linux

cluster, two large-scale shared file systems, and a dedicated high speed network.

To set up a parametric sweep in this environment, checkpoints are created using the

methods outlined in Section 3.3.4 and transferred to the $HOME directory of the user

in HPC. From here, a submission script is written for each simulation which specifies

the parameters such as the estimated time for the simulation, using the results from

Section 3.4. The script also contains commands which starts the simulation in batch

mode, configures the appropriate parameters such as the page walk latency, proceeds

with the simulation and archives the results on completion of the job.

To conduct large scale simulation studies, wide-area grid resource in Archer [25] is

also used. Archer is an open infrastructure for simulation-based computer architecture

research. Archer consists of a few hundred cores, each with Simics installed in it,

connected through a wide area P2P network. It also has a cluster wide NFS which

facilitates the sharing of files on a node seamlessly throughout the cluster. Using this

infrastructure one or more nodes are populated with the checkpoints of the workloads.

Using this node as a repository for the checkpoints, many simulations are started off and

configured to run in parallel with different parameter values for each run.

141

APPENDIX CUSING THE TAG MANAGER TABLE FOR TAGGING I/O TLB

Power and performance considerations for high throughput computing platforms are

leading to a situation wherein simpler CPU cores are becoming the processor of choice

even for high throughput platforms. A case in point is the trend of the Intel Atom family

of processors being increasingly preferred, in spite of their lower processing capability,

in high throughput servers over power hungry but more capable processor variants

such as Xeon [118, 119]. To fill this gap in advanced and specialized functionalities, the

high throughput platforms with low-power processors need to either execute these

functionalities in software, on the main processor cores, or integrate specialized

hardware units or accelerators which offer these functionalities for offload. Various

power/performance tradeoffs dictate the later as the approach of choice [118]. Even in

cases where more complex processor architectures are employed, there are significant

power savings to be obtained by employing specialized accelerators designed for

common compute intensive functions and offloading such functions from the complex

processor to these accelerators.

Traditional approaches for integrating such specialized accelerators and for

offloading jobs to them view the accelerator as a device and rely on a software

device driver for interfacing. This approach works well when the execution time on

the accelerator is of a magnitude bigger than the overheads incurred in offloading a task.

However, for the case in point i.e. High performance systems with very fine-grain

functionality offload, a generic interface specification that reduces performance

overheads and allows seamless portability of programs across platforms with varying

degrees of hardware support is needed [120–122]. Several approaches including

allowing the accelerator to operate in the application domain’s virtual memory space,

making applications offload aware and achieving tight integration between CPUS

and accelerators have been proposed. However, in order to allow the accelerator to

142

operate in the same address space as the process, the accelerator has to be aware

that the offloaded data is being specified by an address in the virtual address domain.

Moreover, the virtual address should be translated to the physical address before the

data can be accessed from memory. Thus, for performance considerations, an I/O TLB

is needed to cache the virtual to physical translations used by the accelerator. Since

multiple processes may offload jobs to the accelerator in an interleaved fashion, this

TLB should be capable of being shared by multiple process’s address spaces [120].

The Tag Manager Table may be used in this scenario. In this dissertation, one specific

accelerator interfacing scheme, Virtual Memory Accelerator (VMA) [120], is considered

and the use of the TMT in this VMA architecture is demonstrated.

C.1 Architecture of VMA

The two major objectives of VMA are 1. establishing a low-latency interface with

minimum software overheads for improved performance and 2. allowing user-mode

data offload for programmability and seamless portability of the application across

platforms with varying degrees of hardware support. VMA achieves this by allowing the

accelerators to work in the same address space domain as the process which offload

to it and by providing an extended ISA for offloading the task to the accelerator. The

architecture of VMA, as shown in Figure C-1, has four components:

• Extended ISA for offloading: The offloading infrastructure consists of themechanism in which the user application offloads a task to the accelerator.The information which has to be passed to the accelerator, typically includes asource buffer with the data, a destination buffer to store the processed resultsand a command word which informs the accelerator on how the data shouldbe processed. This is implemented by extending the ISA with two instructionsPUTTXN and GETTXN. The PUTTXN instruction provides a process an atomicmethod to send data and command word to the accelerator. This instructionreturns a unique transaction ID that the process can use to query the hardware forcompletion status. The GETTXN instruction provides a process with a method forquerying the hardware for completion status for a given transaction.

• Virtual memory aware accelerators: Hardware accelerators can be made ”virtualmemory aware” by providing them with an application context at the time of offload,by including a ”context ID” as a part of the offloaded functionality. This context id

143

is then provided by the accelerator as a part of every memory transaction that itissues, in order to identify the process address space in which it operates and tofacilitate mapping from this address space to the physical memory space.

• IPMMU for I/O virtual to physical address translations: The IP (IntellectualProperty) memory management unit (IPMMU) is provided in the interconnectionfabric and offers address translation services to the accelerators so that they canexecute in virtual memory domain. This also allows the programs to access theaccelerator functions directly from the user space and communicate using virtualmemory addresses. When the accelerator tries to access application memorywith a virtual address, the Accelerator Memory Management Unit (IPMMU) willintercept the request and automatically translate the virtual address into thecorresponding physical address. For address translation efficiency, IPMMUcontains a TLB to cache the recent address translations. This I/O TLB is similarin structure and organization to the core TLB with the addition of a tag whichidentifies the entry in a TLB with the context of the application for whose addressspace the translation is valid.

• Page Fault Handling: Similar to page faults caused during the address translationon the core, memory accesses initiated by the accelerator and intercepted bythe IPMMU may fail in the address translation. VMA implements a fault reportingmechanism which delivers this I/O page fault to the software stack running on thesystem and a fault handling mechanism consisting of software modules to handlethese page faults.

C.2 Prototyping and Simulating the VMA Architecture

In order to model the hardware and software components of VMA, Virtutech

Simics, which has been discussed in detail in Section 3.2.1, is chosen as the simulation

framework for developing the VMA prototype. Using Simics, a platform consisting of

an Intel Xeon CPU with an X58 chipset and ICH10 Southbridge is simulated and 64-bit

Linux2.6.28 is booted on this platform. This platform, shown in Figure C-1, is used for

modeling and simulating the VMA prototype.

Extending the ISA with offload instructions

In order to simulate the PUTTXN and GETTXN instructions for enabling fine-grained

instruction based offload, the Magic Instruction capability of Simics is used. The magic

instruction, for x86 models, is the xchg bx, bx instruction. When this is executed by

the software stack running in the simulated platform, Simics stops the simulation and

144

Application

OS

Fabric

IPMMU

VMA

AcceleratorMemory

SW

HW

Core

Nehalem CPU

X58 NORTHBRIDGE

RAM

GFX

NIC

RAM

PCI BUS

FSB

ISA DEVICES

ICH10 SOUTHBRIDGE

IPMMU

ACCELERATOR

Architecture of VMA Simics simulation Framework

Figure C-1. Architecture and simulation-based prototype of VMA. The architecture ofVMA consists of extended ISA for offloading to the accelerator, acceleratorswhich are virtual memory aware, an IPMMU to translate from virtual tophysical address with a tagged TLB to cache these translations andsoftware handler for IPMMU-generated page faults. These components areprototyped using Simics full-system simulation framework

surrenders control to a user-defined HAP script. This script may be used to examine

the architectural state of the suspended simulation and modify it, if necessary. Once the

actions specified in this script are completed, Simics resumes the simulation from the

point where it was stopped.

For the PUTTXN instruction, the appropriate arguments, such as the source and

destination buffer address are loaded into general purpose x86 registers. An instruction

identifier, which identifies that the magic instruction is used to simulate the PUTTXN

instruction, is also loaded into a register, following which the magic instruction is called.

The HAP script which is invoked on this magic instruction reads the instruction identifier

and simulate the PUTTXN instruction by copying the arguments from these registers

to the appropriate locations in the register bank of the simulated accelerator. The Tag

145

Manager Table is also queried and the VASI from the CCR of the CPU on which the

offloading application is executing is also provided to the accelerator as the context id.

This script also generates a transaction id and updates both the accelerator as well as

the general purpose register in which the source buffer address was specified with this

transaction id. The script also provides the Software Trigger to the accelerator to initiate

the offload. On resuming the simulation, the accelerator begins to process the offloaded

task by issuing PCI transactions for accessing the data from the source buffer. The

offloading application reads the transaction id from the general purpose register which

was populated by the HAP script.

The GETTXN instruction is simulated in a similar fashion using the magic instruction

by loading the identifier for the GETTXN instruction as well as the transaction id into

general purpose registers and then executing the magic instruction. The script invoked

on the execution of the magic instruction checks the completion bit in the hardware

accelerator and copy this value into the EAX register. On resuming the simulation, the

value which has been loaded in the EAX register is read by the user application to check

the completion status of the offloaded task.

It should be noted that the use of the general purpose registers is an artefact of

simulation. In reality, a location in memory can be used to offload the task and to read

the transaction id. The accelerator may be made aware of this memory location during

the boot-up initialization.

Prototyping the virtual memory aware accelerator

The sample accelerator prototyped in this research is a PCI based image

processing accelerator with fine grain functionality offloads1 . A PCI based accelerator

1 It should be noted that the fine granularity refers to the functionality that is offloadedand not to granularity of the data size. One example of such fine-grained functionalityis the SIMD extensions such as SSE and AVX which operate on 128-byte wide and

146

is chosen due to the ease of modeling such devices and integrating the model with the

simulated machine in Simics.

Similar to most PCI Type 0 device, the configuration space of the accelerator model

is implemented as a bank of registers which are programmed by the O/S during device

discovery and enumeration and can map up to six functional regions into the address

space of the CPU. The accelerator also implements a 4KB internal buffer used for the

internal computation of the accelerator which is not mapped into the processor address

space. The accelerator utilizes two of the six functional regions, FN0 and FN1. Each

of these functional regions consists of a bank of registers, which can be addressed as

Memory-mapped I/O (MMIO) addresses after device enumeration.

FN0 implements a simple Sum-of-Products (SOP) functionality. A SOP computation

can be offloaded to the accelerator by writing the address of the source buffer

which contains the elements of the row and column along with the dimension of the

row/column as well as the destination buffer to the appropriate registers in FN0. Once

these buffer addresses are provided, the computation of the SOP is initiated by writing

to the ”Software Trigger” register in the FN0 register bank. On receiving the trigger, the

accelerator reads the contents from the source buffer using PCI-to-memory transactions,

computes the SOP and writes the result to the specified destination buffer. Then, it

sets a completion bit in its register bank. The completion of the offloaded task may be

notified to the software stack by either converting the setting of the completion bit to an

I/O interrupt or by polling the completion bit in this register bank at regular intervals. FN1

implements a pixel manipulation functionality. Given an image and the transformation

matrix, FN1 multiplies each of the pixel by the transformation matrix and writes the

transformed image into the specified destination buffer. A user application can offload

256-byte wide data and perform fine-grained operations such as floating point arithmeticon these data

147

an image manipulation functionality to the accelerator by writing to FN1’s registers in a

manner similar to the FN0 offload. These functionalities are chosen as they are quite

important in image processing and are ideal candidates for acceleration [123].

The accelerator is made Virtual Memory Aware by providing the context information

(VASI tag) as apart of the offload. The accelerator then includes this context id as a part

of every PCI to memory transaction. In order to achieve this, the format of the PCI bus

TLP header is changed and the context information field is added to it. Moreover, by

incorporating the context id as apart of the PCI transaction, the accelerator is able to

support offloads from multiple user processes with different contexts and process these

in an interleaved and pipelined fashion.

Handling IPMMU-generated page faults

When the IPMMU walks the page tables to translate the virtual address belonging

to a particular process to its physical equivalent, this page walk may result in a page

due to mismatch in the access permissions for the page (Read/Write permissions or

User/Supervisor privileges) and the desired type of access. However, a more common

reason for page faults is the lack of a physical page corresponding to the virtual address

being accessed. For instance, in Linux, typical allocations of user space buffers are lazy

in nature (i.e.), the physical memory for the buffers are not assigned when the buffers

are created. When the program running on the core attempts to access the buffer, this

results in a demand page fault. The O/S page fault handler allocates the page and

updates the page table and then restarts the faulting instruction.

In the VMA architecture, since the accelerator also works in the same virtual

address space as the user application, the transactions it issues may also cause

such demand paging faults. In addition, swapping out of the physical pages page

corresponding to a user-space buffer (due to memory limitations) before that buffer

can be accessed by the accelerator may also generate page faults. Whenever a page

fault is caused by the IPMMU walking the page tables, the cause of that page fault is

148

determined. If it is due to a mismatch in the permissions or privilege bits, this is treated

as an unrecoverable error and the PCI read/write transaction is terminated with an

explicit error indication, as mandated by the PCI standards [124]. The accelerator,

on such terminated PCI requests, waits for a certain retry period and then reissues

the transaction. This retry period can be effectively hidden by the accelerator issuing

memory requests of another offloaded task while it is waiting. After a certain number

of retries, if the PCI transaction cannot be completed, the accelerator terminates the

offloaded job by setting the completion bit and indicates the unsuccessful completion of

the task by setting an error bit.

For page faults caused by the lack of an entry in the page tables, the IPMMU

raises an interrupt using the IPMMU Fault Reporting Mechanism (FRM). The FRM

is similar to the VT-d fault reporting mechanism [125]. It consists of a bank of Fault

Recording Registers (FRR), as shown in Figure C-2, with each register having fields

for storing the faulting address and the process context in which the fault occurred. The

IPMMU populates one of these registers with the faulting information and raises an

interrupt. Then, it terminates the PCI transaction with an explicit error indication. The

IPMMU software fault handler catches the interrupt and verifies that the interrupt was

raised due to a page fault. It then reads the faulting address and context from the Fault

Recording Register, allocates physical memory and maps the faulting virtual address

to the allocated memory by updating the page tables. The IPMMU fault handler then

clears the Fault Recording Register and terminates. Subsequently, when the faulting

PCI transaction is reissued by the accelerator, the page walk results in a successful

translation of the virtual to physical address and the transaction successfully completes.

Simulating the IPMMU and the I/O TLB

The IPMMU is implemented on the Simics simulated platform in the Northbridge,

as shown in Figure C-2. It is designed to intercept all traffic between the accelerators

(I/O devices) and memory, in order to provide translation for requests from VM aware

149

accelerators. On intercepting a PCI-to-memory transaction, the IPMMU examines the

context id field of the TLP header. The presence of a non-zero context id indicates that

the device which originated the transaction is a VMA device and the target address

specified in the transaction is a virtual address.

CPU

NORTHBRIDGE

RAM

PCI BUS

FSB

IPMMU

CORE

TLB

IPMMU

TLB

IPMMU

FRR

PW

FUNCTIONALITY

Figure C-2. IPMMU and I/O TLB

Using the supplied context id, the IPMMU first checks to see if the translation

is cached in the IPMMU Translation Lookaside Buffer (I/O TLB). This IPMMU TLB

is a tagged TLB, similar to the architecture described in Section 4.2. Every entry is

annotated with the context id of the offloading user application. Such a tagged design

allows the translations of multiple processes to coexist in the TLB and allows the IPMMU

to handle translation requests from multiple user applications in an interleaved fashion. If

the required virtual to physical translation for the context id of the PCI transaction being

currently processed is not found in the IPMMU TLB, the IPMMU initiates an address

translation process by walking the O/S page tables of the user application. Once the

page walk is completed, and the physical address corresponding to the virtual address

150

is obtained, the IPMMU reprograms the PCI transaction with this physical address and

allows it to access the data from that physical address. This translation is also added

to the TLB and tagged with the context id of the offloading user application to which it

belongs.

C.3 Using the Tag Manager Table in VMA Architecture

Since multiple processes may offload tasks to the multiple accelerators, and since

these may be executed in an interleaved fashion, the IPMMU will have to perform

address translations for multiple address spaces in an interleaved fashion. Given this, it

is imperative to tag the I/O TLB entries and thereby ensure that multiple process entries

may be cached concurrently. The Tag Manager Table may be used for generating this

process-specific tag.

In this work, the I/O TLB is designed to have a separate TMT. The CR3 value of

the offloading process itself is used as the context id. The TMT establishes unique

CR3-to-VASI mappings and uses these VASIs to tag the TLB entries. The IPMMU,

on intercepting a memory access from the accelerator, uses the TMT to get the

CR3-to-VASI mapping and looks up the tagged TLB using this VASI for the required

translation. If this translation is not present, the page walk is performed and the

computed translation is annotated with the VASI and cached in the TLB. A simple

TLB synchronization scheme, wherein every core TLB flush also flushes the I/O TLB, is

used. However, it is also possible to use a core tagged TLB and a global Tag Manager

Table, as in Section 4.8, and to have the same process-to-VASI mapping used in both

the core and I/O TLB. In such a design, the I/O TLB, similar to the core TLB, will be

flushed only during capacity flushes and forced flushes. Thus, the number of I/O TLB

flushes may be significantly reduced. In addition, if the context id being used is not

the CR3 value of the offloading process, a CR3-to-context id mapping should also be

maintained for every offloading process.

151

A Lena before conversion using VMA accel-erator

B Lena after conversion using VMA acceler-ator

Figure C-3. Functional validation of the use of TMT in VMA

C.4 Functional Verification of the Use of TMT in VMA

In order to verify the working of the VMA architecture in conjunction with the Tag

Manager Table, a simple image-manipulation test application is created. This application

reads in an image from a file, allocates source and destination buffer and populates the

source buffer with the pixels from the image. It should be noted that these buffers are

created using lazy memory allocation. Since the image is read into the source buffer,

demand paging and the conventional O/S page fault handler takes care of allocating

physical memory for the source buffer. On the other hand, since the destination buffer

is not accessed by the user application before offload, there is no physical memory

allocated for this buffer. This application offloads the pixels of the image, along with a

transformation matrix for converting the image to grayscale, to the accelerator by writing

the source and destination buffers to the registers of FN1 using the PUTTXN instruction.

After this it spins in a loop polling for the completion of the offload using the GETTXN

instruction. It should be noted that the data granularity of the offload is fixed at 4KB,

resulting in the application offloading the pixels on a page-by-page basis till all the pixels

are converted to grayscale.

152

The image chosen for this simulation was a 512 × 512 sized version of the standard

image ”Lena” [126]. Converting this image to a 32bits per pixel representation resulted

in a source buffer size of 1MB. Since no compression was used to store the grayscale

output, the destination buffer was also 1MB. Dictated by the 4KB size of the offload data

granularity, the source buffer was offloaded in 4KB chunks resulting in 256 offloads to

the hardware accelerator. Since the destination buffer was created using lazy memory

allocation, the very first PCI write to the destination buffer on each of these 256 offloads

caused an IPMMU page fault. Each of these faults raised interrupts which were caught

and handled by the IPMMU page fault handler. It was also observed that a maximum

of three retries with 10s retry period was sufficient to ensure that the IPMMU page

fault was serviced and the PCI write transaction successfully completed. Moreover,

for this simulation, a 99.90% hit rate in the IPMMU TLB was observed. The original

and converted images are shown in Figure C-3. This validates the working of the VMA

architecture with the TMT.

C.5 Summary

While the majority of this dissertation investigates the use of the Tag Manager

Table for improving the performance of virtualized workloads, the TMT is a generic

tagging framework that uses process-specific tags and can be used for non-virtualized

scenarios as well. This appendix proposes the use of the TMT for tagging I/O TLBs

in non-virtualized platforms. Specifically, the incorporation of the TMT as a tagging

framework in Virtual Memory Accelerators, an architecture involving I/O accelerators

operating in virtual address domain with an IPMMU and I/O TLB for providing the virtual

to physical translations, is examined. Using a simulation-based prototype of VMA, the

proposed use of the TMT is functionally validated.

153

REFERENCES

[1] R. Miller. (2010, April) Facebook Now Has 30,000 Servers. [Online].Available: http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/

[2] Avanade. (2010, April) Global Survey of Cloud Computing. [Online].Available: http://www.avanade.com/Documents/Research%20and%20Insights/fy10cloudcomputingexecutivesummaryfinal314006.pdf

[3] K. Olukotun et al., “The case for a single-chip multiprocessor,” SIGPLAN Notices,vol. 31, no. 9, pp. 2–11, 1996.

[4] I. Corporation. (2010, April) First the Tick, Now the Tock: NextGeneration Intel Microarchitecture (Nehalem). [Online]. Available: http://www.intel.com/technology/architecture-silicon/next-gen/whitepaper.pdf

[5] M. R. Marty and M. D. Hill, “Virtual hierarchies to support server consolidation,”SIGARCH Computer Architecture News, vol. 35, no. 2, pp. 46–56, 2007.

[6] M. F. Mergen et al., “Virtualization for high-performance computing,” SIGOPSOperating Systems Review, vol. 40, pp. 8–11, 2006.

[7] L. Youseff et al., “Paravirtualization effect on single- and multi-threadedmemory-intensive linear algebra software,” Cluster Computing, vol. 12, pp.101–122, 2009.

[8] Gartner. (2010, April) Gartner says Worldwide Hosted Virtual DesktopMarket to Surpass 65 Billion in 2013. [Online]. Available: http://www.gartner.com/it/page.jsp?id=920814

[9] ——. (2010, April) Gartner Says 20 Percent of Commercial E-Mail MarketWill Be Using a SaaS Platform By the End of 2012. [Online]. Available:http://www.gartner.com/it/page.jsp?id=931215

[10] J. Lange et al., “Palacios and Kitten: New High Performance Operating Systemsfor Scalable Virtualized and Native Supercomputing,” in Parallel DistributedProcessing (IPDPS), 2010 IEEE International Symposium on, 2010, pp. 1–12.

[11] J. Smith and R. Nair, Virtual Machines: Versatile Platforms for Systems andProcesses. Morgan Kaufmann Publishers Inc., 2005.

[12] R. Goldberg, “Survey of Virtual Machine Research,” Computer, vol. 7, no. 6, pp.34–45, 1974.

[13] G. Amdahl, G. Blaauw, and F. Brooks, “Architecture of IBM System/360,” IBMJournal of Research and Development, vol. 8, no. 2, pp. 87–101, 1964.

154

http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/

http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/

http://www.avanade.com/Documents/Research%20and%20Insights/fy10cloudcomputingexecutivesummaryfinal314006.pdf

http://www.avanade.com/Documents/Research%20and%20Insights/fy10cloudcomputingexecutivesummaryfinal314006.pdf

http://www.intel.com/technology/architecture-silicon/next-gen/whitepaper.pdf

http://www.intel.com/technology/architecture-silicon/next-gen/whitepaper.pdf

http://www.gartner.com/it/page.jsp?id=920814



[14] U. Drepper, “The Cost of Virtualization,” ACM Queue, vol. 6, no. 1, pp. 28–35,2008.

[15] Gartner. (2010, April) Market Share: x86 Virtualization Market, Worldwide, 2008.[Online]. Available: http://www.gartner.com/it/page.jsp?id=1211813

[16] I. Kadayif et al., “Optimizing instruction TLB energy using software and hardwaretechniques,” ACM Transactions on Design Automation of Electronic Systems,vol. 10, no. 2, pp. 229–257, 2005.

[17] C. McCurdy, A. L. Cox, and J. Vetter, “Investigating the TLB Behavior of High-endScientific Applications on Commodity Microprocessors,” in Proc. IEEE InternationalSymposium on Performance Analysis of Systems and software, 2008, pp. 95–104.

[18] O. Tickoo et al., “qTLB: Looking inside the Look-aside buffer,” in Proc. The 14thInternational Conference on High Performance Computing, 2007, pp. 107–118.

[19] A. Bhattacharjee, D. Lustig, and M. Martonosi, “Shared last-level TLBs for chipmultiprocessors,” in Proc. The 17th International Symposium on High PerformanceComputer Architecture, 2011, pp. 359–370.

[20] D. Chisnall, The Definitive Guide to the Xen Hypervisor (Prentice Hall OpenSource Software Development Series). Prentice Hall PTR, 2007.

[21] VMware Inc. (2010, April) VMware Virtual Desktop Infrastructure (VDI) datasheet.[Online]. Available: http://www.vmware.com/files/pdf/vdi datasheet.pdf

[22] I. Krsul et al., “VMPlants: Providing and Managing Virtual Machine ExecutionEnvironments for Grid Computing,” in Proc. The 2004 ACM/IEEE conference onSupercomputing, 2004, p. 7.

[23] A. Weiss, “Computing in the clouds,” netWorker, vol. 11, no. 4, pp. 16–25, 2007.

[24] R. Figueiredo, P. Dinda, and J. Fortes, “Guest Editors’ Introduction: ResourceVirtualization Renaissance,” Computer, vol. 38, no. 5, pp. 28–31, 2005.

[25] R. J. O. Figueiredo et al., “Archer: A Community Distributed ComputingInfrastructure for Computer Architecture Research and Education,” CollaborativeComputing: Networking, Applications and Worksharing, vol. 10, no. 2, pp.181–192, 2009.

[26] SPARC International, Inc, The SPARC Architecture Manual Version 9. PTRPrentice Hall, 1993.

[27] Compaq Computer Corporation, ALPHA Architecture Reference Manual. CompaqComputer Corporation, 2002.

[28] Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manuals.Intel Corporation, 2010.

155


http://www.vmware.com/files/pdf/vdi_datasheet.pdf

[29] B. Jacob and T. Mudge, “Virtual memory in contemporary microprocessors,” IEEEMicro, vol. 18, no. 4, pp. 60–75, 1998.

[30] ——, “A look at several memory management units, TLB-refill mechanisms, andpage table organizations,” SIGOPS Operating Systems Review, vol. 32, no. 5, pp.295–306, 1998.

[31] B. Jacob. (2010, April) Virtual Memory Systems and TLB Structures. [Online].Available: http://www.ece.umd.edu/∼blj/papers/CEH-chapter.pdf

[32] C. A. Waldspurger, “Memory resource management in VMware ESX server,”SIGOPS Operating Systems Review, vol. 36, no. SI, pp. 181–194, 2002.

[33] R. A. MacKinnon, “The changing virtual machine environment: Interfaces to realhardware, virtual hardware, and other virtual machines,” IBM Systems Journal,vol. 18, no. 1, pp. 18–46, 1979.

[34] L. H. Seawright and R. A. MacKinnon, “VM/370: a study of multiplicity andusefulness,” IBM Systems Journal, vol. 18, no. 1, pp. 4–17, 1979.

[35] P. Barham et al., “Xen and the art of virtualization,” in Proc. The nineteenth ACMsymposium on Operating systems principles, 2003, pp. 164–177.

[36] Advanced Micro Devices. (July 2010, April) AMD-V Nested Paging. [Online].Available: http://developer.amd.com/assets/NPT-WP-1%201-final-TM.pdf

[37] G. Neiger et al., “Intel Virtualization Technology: Hardware Support for EfficientProcessor Virtualization,” Intel Technology Journal, vol. 10, no. 3, pp. 167–178,2006.

[38] N. Jerger, D. Vantrease, and M. Lipasti, “An Evaluation of Server ConsolidationWorkloads for Multi-Core Designs,” in Proc. 10th International Symposium onWorkload Characterization, 2007, pp. 47–56.

[39] L. Cherkasova, D. Gupta, and A. Vahdat, “Comparison of the three CPU schedulersin Xen,” SIGMETRICS Performance Evaluation Review, vol. 35, no. 2, pp. 42–51,2007.

[40] D. Gupta et al., “Enforcing performance isolation across virtual machines in Xen,”in Proc. The ACM/IFIP/USENIX 2006 International Conference on Middleware,2006, pp. 342–362.

[41] J. R. Santos et al., “Bridging the gap between software and hardware techniquesfor I/O virtualization,” in Proc. USENIX 2008 Annual Technical Conference, 2008,pp. 29–42.

[42] W. Huang et al., “A case for high performance computing with virtual machines,”in Proc. The 20th annual international conference on Supercomputing, 2006, pp.125–134.

156

http://www.ece.umd.edu/~blj/papers/CEH-chapter.pdf

http://developer.amd.com/assets/NPT-WP-1%201-final-TM.pdf

[43] L. Cherkasova and R. Gardner, “Measuring CPU overhead for I/O processing inthe Xen virtual machine monitor,” in Proc. USENIX Annual Technical Conference,2005, pp. 24–24.

[44] A. Menon et al., “Diagnosing performance overheads in the xen virtual machineenvironment,” in Proc. The 1st ACM/USENIX international conference on Virtualexecution environments, 2005, pp. 13–23.

[45] S. Thibault and T. Deegan, “Improving performance by embedding hpc applicationsin lightweight xen domains,” in Proc. The 2nd workshop on System-levelvirtualization for high performance computing, ser. HPCVirt ’08, 2008, pp. 9–15.

[46] R. Uhlig et al., “Intel Virtualization Technology,” Computer, vol. 38, no. 5, pp. 48–56,2005.

[47] D. Abramson et al., “Intel Virtualization Technology for Directed I/O,” IntelTechnology Journal, vol. 10, no. 03, pp. 179–192, 2006.

[48] Advanced Micro Devices, AMD Secure Virtual Machine Architecture ReferenceManual. Advanced Micro Devices, 2010.

[49] G. B. Kandiraju and A. Sivasubramaniam, “Going the distance for TLB prefetching:an application-driven study,” in Proc. The 29th annual international symposium onComputer architecture, 2002, pp. 195–206.

[50] A. Bhattacharjee and M. Martonosi, “Inter-Core cooperative TLB prefetchers forchip multiprocessors,” in Proc. The 15th international conference on Architecturalsupport for programming languages and operating systems, 2010, pp. 359–370.

[51] ——, “Characterizing the TLB Behavior of Emerging Parallel Workloads on ChipMultiprocessors,” in Proc. International Conference on Parallel Architectures andCompilation Techniques, 2009, pp. 29–40.

[52] V. Chadha et al., “I/O processing in a virtualized platform: a simulation-drivenapproach,” in Proc. The 3rd international conference on Virtual execution environ-ments, 2007, pp. 116–125.

[53] V. Chadha, “Provisioning wide-area virtual environments through I/O interposition:The redirect-on-write file system and characterization of i/o overheads in avirtualized platform,” Ph.D. dissertation, University of Florida, 2008.

[54] R. Uhlig et al., “SoftSDV: A Presilicon Software Development Environment for theIA-64 Architecture,” Intel Technology Journal, vol. 3, no. 4, pp. 1–14, 1999.

[55] M. Ekman, P. Stenstrom, and F. Dahlgren, “TLB and snoop energy-reduction usingvirtual caches in low-power chip-multiprocessors,” in Proc. The 2002 internationalsymposium on Low power electronics and design, 2002, pp. 243–246.

157

[56] S. Manne et al., “Low Power TLB Design for High Performance Microprocessors,”University of Colorado at Boulder, CO, Tech. Rep. CU-CS-834-97, 1997.

[57] J.-H. Lee et al., “A banked-promotion translation lookaside buffer system,” Journalof Systems Architecture, vol. 47, no. 14-15, pp. 1065–1078, 2002.

[58] A. Ballesil, L. Alarilla, and L. Alarcon, “A Study of Power Trade-offs in TranslationLookaside Buffer Structures,” in Proc. 2006 IEEE Region 10 Conference, 2006, pp.1–4.

[59] L. T. Clark, B. Choi, and M. Wilkerson, “Reducing translation lookaside buffer activepower,” in Proc. The 2003 international symposium on Low power electronics anddesign, 2003, pp. 10–13.

[60] R. Jeyapaul, S. Marathe, and A. Shrivastava, “Code Transformations for TLB PowerReduction,” in Proc. The 22nd International Conference on VLSI Design, 2009, pp.413–418.

[61] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: an infrastructure for computersystem modeling,” Computer, vol. 35, no. 2, pp. 59–67, 2002.

[62] R. Bhargava et al., “Accelerating two-dimensional page walks for virtualizedsystems,” in Proc. The 13th international conference on Architectural support forprogramming languages and operating systems, 2008, pp. 26–35.

[63] G. Loh, S. Subramaniam, and Y. Xie, “Zesto: A cycle-level simulator for highlydetailed microarchitecture exploration,” in Proc. IEEE International Symposium onPerformance Analysis of Systems and Software, 2009, pp. 53–64.

[64] M. Yourst, “PTLsim: A Cycle Accurate Full System x86-64 MicroarchitecturalSimulator,” in Proc. IEEE International Symmposium onPerformance Analysis ofSystems and Software, 2007, pp. 23–34.

[65] M. Rosenblum et al., “Using the SimOS machine simulator to study complexcomputer systems,” ACM Transactions on Modeling and Computer Simulation,vol. 7, no. 1, pp. 78–103, 1997.

[66] N. L. Binkert et al., “The M5 Simulator: Modeling Networked Systems,” IEEE Micro,vol. 26, no. 4, pp. 52–60, 2006.

[67] P. S. Magnusson et al., “Simics: A full system simulation platform,” Computer,vol. 35, no. 2, pp. 50–58, 2002.

[68] M. M. K. Martin et al., “Multifacet’s general execution-driven multiprocessorsimulator (GEMS) toolset,” SIGARCH Computer Architecture News, vol. 33, no. 4,pp. 92–99, 2005.

[69] Naveen Neelakantam . (2010, April) FeS2: A Full-system Execution-drivenSimulator for x86. [Online]. Available: http://fes2.cs.uiuc.edu/

158

http://fes2.cs.uiuc.edu/

[70] E. Argollo et al., “COTSon: infrastructure for full system simulation,” SIGOPSOperating Systems Review, vol. 43, no. 1, pp. 52–61, 2009.

[71] Advanced Micro Devices Inc, SimNow Simulator Users Manual. Advanced MicroDevices Inc, 2009.

[72] L. Baugh, N. Neelakantam, and C. Zilles, “Using Hardware Memory Protectionto Build a High-Performance, Strongly-Atomic Hybrid Transactional Memory,”SIGARCH Computer Architecture News, vol. 36, no. 3, pp. 115–126, 2008.

[73] Virtutech Inc, Simics Reference Manual. Virtutech Inc, 2007.

[74] S. T. Jones, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “Antfarm: trackingprocesses in a virtual machine environment,” in Proc. USENIX ’06 AnnualTechnical Conference, 2006, pp. 1–1.

[75] CPU RightMark . (2010, April) RightMark Memory Analyzer. [Online]. Available:http://cpu.rightmark.org/products/rmma.shtml

[76] V. Makhija et al., “VMmark: A Scalable Benchmark for Virtualized Systems,”VMware Inc, CA, Tech. Rep. VMware-TR-2006-002, September 2006.

[77] D. R. Llanos, “TPCC-UVa: an open-source TPC-C implementation for globalperformance measurement of computer systems,” SIGMOD Record, vol. 35, no. 4,pp. 6–15, 2006.

[78] A. Tridge. (2010, April) dbench benchmark. [Online]. Available: http://samba.org/ftp/tridge/dbench/

[79] M. Karlsson et al., “Memory System Behavior of Java-Based Middleware,” inProceedings of the 9th International Symposium on High-Performance ComputerArchitecture, 2003, pp. 217–228.

[80] Y. Shuf et al., “Characterizing the memory behavior of Java workloads: a structuredview and opportunities for optimizations,” in Proc. The 2001 ACM SIGMETRICSinternational conference on Measurement and modeling of computer systems,2001, pp. 194–205.

[81] A. Adamson, D. Dagastine, and S. Sarne, “SPECjbb2005 - A Year in the Life of aBenchmark,” in Proc. The 2007 SPEC Benchmark Workshop, 2007.

[82] Standard Performance Evaluation Corporation. (2010, April) 255.vortexSPEC CPU2000 Benchmark Description File. [Online]. Available:http://www.spec.org/cpu2000/CINT2000/255.vortex/docs/255.vortex.html

[83] A. Georges, L. Eeckhout, and K. D. Bosschere, “Comparing Low-Level Behavior ofSPEC CPU and Java Workloads,” Advances in Computer Systems Architecture,vol. 3740, pp. 669–679, 2005.

159

http://cpu.rightmark.org/products/rmma.shtml

http://samba.org/ftp/tridge/dbench/

http://samba.org/ftp/tridge/dbench/

http://www.spec.org/cpu2000/CINT2000/255.vortex/docs/255.vortex.html

[84] S. Dague, D. Stekloff, and R. Sailer. (2010, April) xm(1) - Linux man page. [Online].Available: http://linux.die.net/man/1/xm

[85] N. Andersson. (2010, April) The Maui Scheduler. [Online]. Available:http://www.nsc.liu.se/systems/retiredsystems/grendel/maui.html

[86] G. Staples, “Torque resource manager,” in Proc. The 2006 ACM/IEEE conferenceon Supercomputing, 2006.

[87] J. Warner. (2010, April) top(1) - Linux man page. [Online]. Available:http://linux.die.net/man/1/top

[88] A. Cahalan. (2010, April) pmap(1) - Linux man page. [Online]. Available:http://linux.die.net/man/1/pmap

[89] Silicon Graphics, Inc, MIPS R4000 Microprocessor User’s Manual. PTR PrenticeHall, 1993.

[90] X. Zhang et al., “A hash-TLB approach for MMU virtualization in xen/IA64,” in Proc.IEEE International Symposium on Parallel and Distributed Processing, 2008, pp.1–8.

[91] Motorola Inc, PowerPC 601 RISC Microprocessor User’s Manual. Motorola Inc,2002.

[92] J. Liedtke, “Improved Address-Space Switching on Pentium Processors byTransparently Multiplexing User Address Spaces,” German National ResearchCenter for Information Technology, Tech. Rep. 993, 1995.

[93] V. Uhlig et al., “Performance of address-space multiplexing on the Pentium,”University of Karlsruhe, Tech. Rep. 2002-1, 2002.

[94] S. Biemeuller. (2010, April) ASID Management in Xen AMD-V. [Online]. Available:xen.xensource.com/xensummit/xensummit spring 2007.html

[95] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitativeapproach. Morgan Kaufmann Publishers Inc., 2002.

[96] R. Jain, The Art of Computer Systems Performance Analysis: Techniques forExperimental Design, Measurement, Simulation, and Modeling. Wiley, 1991.

[97] R. Min et al., “Partial tag comparison: a new technology for power-efficientset-associative cache designs,” in Proc. 17th International Conference on VLSIDesign, 2004, pp. 183 – 188.

[98] A. Jaleel, M. Mattina, and B. Jacob, “Last level cache (LLC) performance of datamining workloads on a CMP - a case study of parallel bioinformatics workloads,”in Proc. The Twelfth International Symposium on High-Performance ComputerArchitecture, 2006, pp. 88 – 98.

160

http://linux.die.net/man/1/xm

http://www.nsc.liu.se/systems/retiredsystems/grendel/maui.html

http://linux.die.net/man/1/top

http://linux.die.net/man/1/pmap

xen.xensource.com/xensummit/xensummit_spring_2007.html

[99] L. Zhao et al., “Towards hybrid last level caches for chip-multiprocessors,”SIGARCH Computer Architecture News, vol. 36, pp. 56–63, 2008.

[100] K. B. Ferreira, P. Bridges, and R. Brightwell, “Characterizing application sensitivityto OS interference using kernel-level noise injection,” in Proc. The 2008 ACM/IEEEconference on Supercomputing, 2008, pp. 19:1–19:12.

[101] R. Gioiosa, S. A. McKee, and M. Valero, “Designing OS for HPC Applications:Scheduling,” in Proc. IEEE International Conference on Cluster Computing, 2010,pp. 78–87.

[102] R. Iyer et al., “Datacenter-on-chip architectures: Tera-scale opportunities andchallenges in intel’s manufacturing environment,” Intel Technology Journal, vol. 11,no. 3, pp. 227–237, 2007.

[103] S. Kim, D. Chandra, and Y. Solihin, “Fair Cache Sharing and Partitioning in aChip Multiprocessor Architecture,” in Proc. The 13th International Conference onParallel Architectures and Compilation Techniques, 2004, pp. 111–122.

[104] R. Iyer et al., “QoS policies and architecture for cache/memory in CMP platforms,”SIGMETRICS Performance Evaluation Review, vol. 35, no. 1, pp. 25–36, 2007.

[105] L. R. Hsu et al., “Communist, utilitarian, and capitalist cache policies on CMPs:caches as a shared resource,” in Proc. The 15th international conference onParallel architectures and compilation techniques, 2006, pp. 13–22.

[106] J. Chang and G. S. Sohi, “Cooperative cache partitioning for chip multiprocessors,”in Proc. The 21st annual international conference on Supercomputing, 2007, pp.242–252.

[107] M. K. Qureshi and Y. N. Patt, “Utility-Based Cache Partitioning: A Low-Overhead,High-Performance, Runtime Mechanism to Partition Shared Caches,” in Proc. The39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006, pp.423–432.

[108] S. Srikantaiah, M. Kandemir, and M. J. Irwin, “Adaptive set pinning: managingshared caches in chip multiprocessors,” in Proc. The 13th international conferenceon Architectural support for programming languages and operating systems, 2008,pp. 135–144.

[109] N. Rafique, W.-T. Lim, and M. Thottethodi, “Architectural support for operatingsystem-driven CMP cache management,” in Proc. The 15th internationalconference on Parallel architectures and compilation techniques, 2006, pp. 2–12.

[110] B. M. Beckmann, M. R. Marty, and D. A. Wood, “ASR: Adaptive SelectiveReplication for CMP Caches,” in Proc. The 39th Annual IEEE/ACM InternationalSymposium on Microarchitecture, 2006, pp. 443–454.

161

[111] J. Lee, C. Park, and S. Ha, “Memory access pattern analysis and stream cachedesign for multimedia applications,” in Proceedings of the 2003 Asia and SouthPacific Design Automation Conference, ser. ASP-DAC ’03, 2003, pp. 22–27.

[112] Standard Performance Evaluation Corporation. (2010, April) 301.apsiSPEC CPU2000 Benchmark Description File. [Online]. Available:http://www.spec.org/cpu2000/CFP2000/301.apsi/docs/301.apsi.html

[113] ——. (2010, April) 179.art SPEC CPU2000 Benchmark Description File. [Online].Available: http://www.spec.org/cpu2000/CFP2000/179.art/docs/179.art.html

[114] ——. (2010, April) 255.vortex SPEC CPU2000 Benchmark Description File.[Online]. Available: http://www.spec.org/cpu2000/CFP2000/189.lucas/docs/189.lucas.html

[115] ——. (2010, April) 171.swim SPEC CPU2000 Benchmark Description File. [Online].Available: http://www.spec.org/cpu2000/CFP2000/171.swim/docs/171.swim.html

[116] SAS. (2010, April) SAS: Statistical Analysis Software. [Online]. Available:http://www.sas.com/

[117] (2010, April) The University of Florida High-Performance Computing Center.[Online]. Available: http://www.hpc.ufl.edu/index.php?body=about

[118] D. Eadline, “Low cost/power hpc,” Linux Magazine, 2010.

[119] SeaMicro. (2011, January) SeaMicro to Demonstrate SM10000. [Online].Available: http://www.seamicro.com/

[120] P. Stillwell et al., “HiPPAI: High Performance Portable Accelerator Interface forSoCs,” in Proc. International Conference on High Performance Computing 2009,2009, pp. 109–118.

[121] F. E. Powers, Jr. and G. Alaghband, “Introducing the hydra parallel programmingsystem,” in Proceedings of the eighteenth annual ACM symposium on Parallelismin algorithms and architectures, ser. SPAA ’06, 2006, pp. 116–116.

[122] H. Wong et al., “Pangaea: a tightly-coupled IA32 heterogeneous chipmultiprocessor,” in Proceedings of the 17th international conference on Paral-lel architectures and compilation techniques, ser. PACT ’08, 2008, pp. 52–61.

[123] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up Robust Features,”Lecture Notes in Computer Science, vol. 3951, pp. 404–417, 2006.

[124] R. Budruk, D. Anderson, and T. Shanley, PCI Express System Architecture.Addison-Wesley Professional, 2003.

162

http://www.spec.org/cpu2000/CFP2000/301.apsi/docs/301.apsi.html

http://www.spec.org/cpu2000/CFP2000/179.art/docs/179.art.html

http://www.spec.org/cpu2000/CFP2000/189.lucas/docs/189.lucas.html

http://www.spec.org/cpu2000/CFP2000/189.lucas/docs/189.lucas.html

http://www.spec.org/cpu2000/CFP2000/171.swim/docs/171.swim.html

http://www.sas.com/

http://www.hpc.ufl.edu/index.php?body=about

http://www.seamicro.com/

[125] I. Corporation. (2011, January) Intel Virtualization Technology for Directed I/O.[Online]. Available: ftp://download.intel.com/technology/computing/vptech/Intel%28r%29 VT for Direct IO.pdf

[126] M. Wakin. (2011, January) Standard Test Images. [Online]. Available:http://www.ece.rice.edu/∼wakin/images/

163

ftp://download.intel.com/technology/computing/vptech/Intel%28r%29_VT_for_Direct_IO.pdf

ftp://download.intel.com/technology/computing/vptech/Intel%28r%29_VT_for_Direct_IO.pdf

http://www.ece.rice.edu/~wakin/images/

BIOGRAPHICAL SKETCH

Girish Venkatasubramanian was born in Coimbatore, India in 1981. He attended

GRG Matriculation and Higher Secondary School, India and graduated with the ”Best

Outgoing Student” award in 1999. He obtained his Bachelor of Engineering degree (First

Class with Distinction) in Electrical and Electronics Engineering from PSG College of

Technology, India. During this time he received the ”Deans Letter of Commendation for

Academic Performance” twice.

Girish was accepted to the Department of Electrical and Computer Engineering

at University of Florida in 2003, from where he graduated with a Master of Science

degree in 2005 (4.0 GPA) and a Doctor of Philosophy degree in 2011 (4.0 GPA).

During his PhD, he received the University of Florida International Center’s ”Certificate

of Achievement for Outstanding Academic Performance” and was selected as an

”Outstanding International Student”.

At the University of Florida, Girish joined the Advanced Computing and Information

Systems (ACIS) Lab and conducted research in areas including computer architecture,

operating systems, virtualization and full-system modeling and simulation. To complement

his academic skills, he also completed internships with Intel Corporation and VMware.

After graduation, Girish plans to take up a full-time position at Intel and work in areas

related to virtualization.

164

TAG MANAGEMENT ARCHITECTURE AND...

Documents

Transcript of TAG MANAGEMENT ARCHITECTURE AND...