Programmable Peripheral Devices · programmable peripheral devices. Programmable disks can be used...
Transcript of Programmable Peripheral Devices · programmable peripheral devices. Programmable disks can be used...
1
Programmable Peripheral Devices
Patrick CrowleyDepartment of Computer Science and Engineering
University of WashingtonSeattle, WA 98043
1 Introduction
Many important server applications are I/O bound. For example, large-scale database mining and
decision support applications are limited by the performance of the storage I/O subsystem, and,
likewise, web servers and Internet backbone routers are constrained by the capabilities of the
network I/O system.
For this reason, many proposals have been made in recent years to migrate functionality from
servers onto programmable storage and network devices in order to improve application
performance. This report surveys these proposals and evaluates research areas concerning these
programmable peripheral devices.
Programmable disks can be used to: scale processing power with the size of large scan-
intensive database problems, build scalable, secure, and cost-efficient storage systems, and
implement sophisticated storage optimizations at the disk.
Programmable network interfaces (NIs) can be used to: unburden the host from managing
data transfers in fast networks, scale processing power with the number of links in a network,
and enable complex packet processing, such as the aggressive program-in-a-packet Active
Network proposal, at network speeds.
Both types of programmable peripherals contain all the components found in computer
systems: processor, memory, and communications subsystem. Based on this observation, this
report concludes with a set of common issues, including technical vulnerabilities and areas for
future research.
This report is organized as follows. Section 2 contains background information and a
historical perspective concerning programmable peripherals. Sections 3 and 4 discuss the designs
for and applications of programmable disks and network interfaces, respectively. Examples of
other programmable peripherals are briefly discussed in Section 5. A set of issues common to
2
both programmable disks and network interfaces is presented in Section 6. The report concludes
with a summary and brief set of research proposals in Section 7.
2 Background
The seasoned reader will note that programmable I/O devices are, in fact, far from a new idea.
Programmable peripherals have been implemented and abandoned for good reasons in the past.
To consider the arguments against programmability in these devices, we first recall a Pitfall and
Fallacy from the storage and network I/O chapters, respectively, of a popular computer
architecture textbook [Hennessy and Patterson 1996].
Pitfall: Moving functions from the CPU to the I/O processor to improve performance.
An I/O processor, in this context, is a direct memory access (DMA) device that can do more than
shuffle data. The authors are recalling the programmable I/O processors found in classic
machines such as the IBM 360 which, in the 1960s, had programmable I/O channels [Amdahl et
al. 1964, Cormier et al. 1983]. One application of this programmability was support for linked-
list traversal at the I/O processor. (Interestingly, one group recently proposed the addition of
execution engines at each level in the CPU cache memory hierarchy to enable the overlap of
computation and communication in linked-list traversals [Yang and Lebeck 2000].) This
alleviated the host CPU of the traversal task, so it was free to do other work. The argument
against this usage was that the advances in host CPU performance would prove far greater than
the performance advances of the I/O controller in the next generation. Thus, applications that
used and benefited from this optimization in generation N, actually saw decreased performance
when running on the machine of generation N+1. Put plainly, the host CPU was by far the most
expensive and powerful compute element in the system; to bet against it was folly.
Fallacy: Adding a processor to the network interface card improves performance.
In elaborating on this fallacy, the authors argue essentially the same point: the advantages of host
CPU speed.
The issues raised by H&P are answered by the state-of-the-art in embedded microprocessors
today. Rather than being far slower than high-performance desktop CPUs, embedded
microprocessor integer performance is now within a factor of two of their desktop counterparts
[Keeton et al. 1998] (note the co-author on this reference). This change in relative processor
3
performance has been a consequence of Moore’s Law [Moore 1965]; increasingly, and at all
levels of abstraction, communication is a far scarcer resource than computation.
Generally speaking, increasing I/O performance in a computer system is a matter of cost:
improvements can be achieved by spending more money. Improvements in I/O performance are
costly because I/O components are generally standards-based, with many companies offering
competing, compatible products. For example, any new PC interconnect technology must gain
wide acceptance to achieve the economies of scale necessary to be cost-effective. Thus, only the
most urgent and important problems get solved with expensive, customized I/O systems.
Furthermore, I/O subsystems comprise the bulk of the cost of modern computer systems, despite
being built with commodity components [Hill et al. 2000, editors' introduction to Ch. 7]. I/O
systems are relatively costly since they, generally speaking, do not benefit from Moore’s Law as
do semiconductor devices like processors and memories. Thus, many I/O advances aim to either
reduce the cost for a given level of performance, or improve performance for a given level of
cost.
Programmable microprocessors are an increasingly cost-effective solution for providing
sophisticated control in I/O devices. Advances in VLSI technology have made powerful
embedded microprocessors small, powerful and relatively inexpensive. In fact, most peripherals
common to today's computer systems, including disks, graphics accelerators, and network
interface cards, are built around microprocessors. As mentioned previously, this situation has
sparked numerous research efforts attempting to exploit any excess compute power at peripheral
devices, particularly on devices related to storage I/O and network I/O, to speed I/O intensive
applications in a cost-effective manner. This report surveys the research efforts under way for
programmable disks and programmable networks, unifies and gives context for their common
problems, and identifies areas of future research.
3 Programmable Disks
Compared to modern processors, disks are slow as a result of the physical motion needed to
access data. This fact enables disk manufacturers to implement much of the disk control logic in
software/firmware executed on a microprocessor. Software-based control reduces the number of
electronic and ASIC components on the disk, and therefore reduces cost. In this section, we
4
consider the design of modern disks and survey the approaches taken by researchers to leverage
this programmability in order to increase application performance.
3.1 Basic Operation
Magnetic disk drives store information in the form of magnetic flux patterns. This encoded data
is arranged on sectors within tracks on a platter, as shown in Figure 1. To store information, the
drive receives blocks of digital data through a host interconnect channel, such as SCSI, maps
block addresses to physical sectors, moves the read/write head over the appropriate disk sector,
and encodes the data as flux patterns that are recorded onto the magnetic surface. Information
retrieval is similar, except data is sensed and decoded rather than encoded and written.
As noted by [Ruemmler and Wilkes 1994], modern disk drives contain a mechanism, which
includes the recording and positioning components shown in Figure 1, and a disk controller,
which consists of a microprocessor, memory, and host interface as shown, among other things, in
Figure 2.
Recording and positioning components. The overall performance of the disk is dominated by
the engineering tradeoffs found in the disk mechanism. Two different but intimately related
aspects contribute to disk performance: media transfer rate and storage density.
The media transfer rate for a fixed storage density is primarily determined by two common
performance measures: spindle rotation speed and seek time. Very fast spindle rotation requires a
powerful motor, which consumes more energy, and high-quality bearings, which are more
expensive. Seek time refers to the time needed to position the head over a particular cylinder.
Figure 1. Mechanical components of a disk drive. Source: [Ruemmler and Wilkes 1994].
5
This time is limited by the power of the motor that rotates the arm and the stiffness of the arm
itself.
The storage density for a fixed media transfer rate is a consequence of two forms of density:
linear recording density and track density. The former is constrained by the maximum rate of
magnetic phase change that can be recorded and sensed. Track density refers to how closely
tracks may be packed together on the platter and is the primary source of density improvement.
Track density is influenced heavily by the precision provided by the head positioning and media
sensing mechanism. Both linear and track density are influenced by, and in turn influence, the
speed of the encoding process.
The read-write data channel encodes and decodes the data stream into or from a pattern of
magnetic phase changes. Error correction is built into the encoded data stream (and DSP
techniques can be used to increase data channel speed), and positioning information is recorded
onto the disk surface by the manufacturer to help determine the location of head.
3.2 Disk Controller
The disk controller governs the operation of the mechanism described above. The controller
receives and interprets SCSI requests, manages the media access mechanism, manages data
transfers, and controls the cache. The heart of the controller is the microprocessor. The current
trend is to reduce cost and improve performance by replacing electronic components with
Figure 2. The structure of a disk controller and integration trends. Source: [Riedel 1999].
6
software/firmware, augmenting the processor with DSP capabilities, and tightly integrating the
interfaces to hardware, which permits direct control.
3.2.1 Processor
Since disk performance is limited by the media access rate, which is slow relative to
microprocessor speeds, the control processor does not need to be particularly fast. However,
embedded microprocessor price/performance continues to improve, so, increasingly, disk control
logic, which controls spindle rotation and arm actuation, is being moved into software executed
by the control processor. [Adams and Ou 1997] describe their experience in doing so.
Chip-level system integration is also having an impact. Cirrus logic sells a system-on-a-chip
disk controller, called 3Ci, that integrates: a 66MHz ARM7 32-bit RISC processor core, disk
control logic, a DSP-based integrated read/write channel (PRML), 48 KB SRAM, 128KB ROM,
and a memory controller for off-chip Flash, SRAM and DRAM memory [Cirrus Logic ]. The
next generation of this device will include a 200 MHz ARM core with more on-chip memory.
3.2.2 Memory System
There is considerable semiconductor-based buffer storage (between 64KB and 1MB) on disks
today, and future devices will have even more. Originally, buffer memory performed only rate
matching between the media access rate and the host transfer rate. Today, data caching is used
and, in some cases, provides excellent improvements. Read caching can be performed
optimistically since there is no on-disk penalty associated with reading unnecessary data as the
head moves across a platter; it can simply be discarded. Outside of the disk, however, host-based
prefetching may be affected if the host makes cache content assumptions based on its own
reference pattern. Write caching permits the disk to: organize data before writing and reorganize
disk blocks during operation without interrupting the host CPU. However, write caching, for
reliability, is generally implemented on non-volatile memory to avoid loss of information if
power fails before the cached data can be written. IRAM [Patterson et al. 1997] [Keeton et al.
1997] has been proposed [Patterson and Keeton 1998] as a good integrated processor and
memory system architecture due to its potential low-latency and high-bandwidth characteristics.
Cirrus, by virtue of their 3Ci device, agrees with this call for integration.
7
3.2.3 Communication
All high-performance disk drives use the small computer system interface (SCSI). The SCSI
standard defines both an interconnection fabric and programming interface. SCSI interconnects
are parallel busses shared by several devices. Historically, bus-based interconnects have been the
standard for connecting hosts and storage devices. Following the trend seen in LANs, however,
high-performance, serial, point-to-point interconnect technologies like Fibre Channel [FCIA
2000] are rapidly replacing SCSI in server systems. Fibre Channel is a serial interconnect
technology that uses fewer wires than SCSI (4 rather than the 25, 50, 68, or 80 used in various
SCSI generations) and, therefore, has a smaller connector, and is considerably faster (125 MBps
vs. 80 MBps for SCSI-2). SCSI, the programming interface, can be implemented on top of Fibre
Channel.
The SCSI interface has proven to be a successful abstraction between hosts and storage
devices. The SCSI interface frees programmers and the host from having to manage the storage
device and, furthermore, permits the storage device to implement optimizations beneath the
interface. RAID [Patterson et al. 1988] is an example of an optimized system that presents itself
to the host as a standard SCSI device.
However, SCSI is a low-level interface, and one recommendation of the network-attached
secure disk drive (NASD) [Gibson et al. 1997] project is to replace it with a higher-level, object-
based interface to permit devices to better manage data, meta-data and security. With an object-
based interface, the device would manage the storage of blocks that belong to a particular object.
Hence, when a request comes for that object, the drive has knowledge of all the blocks that are of
potential interest. Presently, the interface permits no expression of relationships between blocks.
The object-based interface also simplifies security concerns, which are paramount for disks than
can be accessed directly across a network by multiple hosts, by associating capabilities with
host/object pairs.
3.3 Control Software
Modern disks are built around microprocessors, and, accordingly, a software control system is
responsible for governing the operation of the device. The control software is not exposed to the
programmer, and it generally resides on a disk-resident ROM or EEPROM. Disk-based operating
systems proposed in the research literature will be discussed in Section 3.4.1.
8
3.4 Applications of Programmability
We have examined the design of modern disks and the factors that have made them
programmable. In this section, we survey the manner in which this programmability has been
exploited to solve problems. These proposals fall into two categories: storage systems and
distributed disk-centric applications.
3.4.1 Storage Systems (NASD, Virtual Log-based FS)
The afore-mentioned NASD project describes a cost-effective scalable storage architecture with
network-attached and secure programmable disks [Gibson et al. 1997]. Disks directly attached to
the network require changes in the programming interface and security model, as mentioned in
Section 3.2.3. The NASD project addresses these issues, and proposes the following four
characteristics: direct data transfer between drive and client, a capability based access control
system permitting asynchronous oversight by a centralized file manager, cryptographic integrity,
and an object-based interface. The NASD work culminates in the demonstration of a parallel
distributed file system, built on a prototype NASD, that provides file system support to a parallel
data mining application. Application performance in their prototype system scales linearly with
the number of NASDs.
Another proposal implements a virtual log based file system on a programmable disk [Wang
et al. 1999]. Wang’s file system uses a virtual log, that is, a disk-based log with non-contiguous
entries, to achieve good performance for small synchronous writes while retaining all the
benefits of a log based file system, including transactional semantics. The technique involves
migrating some of the low-level file system implementation to the disk and performing small
atomic writes near the location of the disk head when data arrives. The authors note that these
techniques do not necessitate a programmable disk; the technique only requires a file system
with precise knowledge of disk state.
3.4.2 Distributed Disk-bound Applications (Active Disks, IDISK)
A number of researchers [Acharya et al. 1998, Gray 1998, Keeton et al. 1998, Riedel et al. 1998]
have proposed executing application-level code on programmable disks, in particular NASDs or
IDISKs, as a means of scaling processing power with the size of very large data sets in certain
scan-intensive database problems. While database machines, which scaled processing power
9
with the number of read/write heads on a single disk, failed in the 80s, these researchers contend
that there are now important applications that need to scale processing power with data set size.
Researchers from CMU describe target applications as: 1) able to leverage the parallelism
available in systems with many disks, 2) operate with a small amount of state, processing data as
it "streams" past, and 3) execute few instructions per byte of data [Riedel et al. 1998]. Most of
these applications are scan-intensive database operations used in data-mining where the same
queries are run over all data, producing a result set that requires further processing. This
approach makes use of the processors in all disks, and does not require all data to be sent across a
host I/O bus for processing.
The Active Disk literature proposes a programming model [Acharya et al. 1998] and an
analytical model [Riedel et al. 1998]. The programming model proposed by the group from UC
Santa Barbara/Univ. of Maryland is simple and calls for stream-based message passing. In this
model, the disk runs DiskOS, a disk-resident OS which handles memory management and stream
communication, and the application developer partitions code between the host and the disks.
This approach is fine for big problems that justify customization, as is the case for certain
problems solved by message-passing multiprocessors, but is far from a comprehensive
programming model. We return to this issue and the larger issue of software models in Section
6.2. The analytical model from CMU is designed to give intuition about the performance of an
active disks system compared to a traditional server [Riedel et al. 1998]. This model, in addition
to most of the arguments from the Active Disk literature, primarily speaks to and argues for
distributing these computations and, therefore, applies to other scalable approaches as well.
Clusters of inexpensive machines are another way of doing this [Arpaci-Dusseau et al. 1998]. In
fact, the arguments for a distributed, serverless file system were laid out in xFS [Anderson et al.
1996], which organized all workstations on a network as peers providing file system services.
The question is whether it is more cost-effective to run the software at each disk, or on a PC that
manages a few disks. The Santa Barbara group compared clusters to active disks for a set of
target applications and found that they were equivalent in terms of performance [Uysal et al.
2000]. However, the active disk solution was 60% less expensive, given late ’99 prices and the
authors’ bargain-hunting skills.
The IDISK project from Berkeley specifically argues for independent disks with considerable
processing power and memory that are capable of autonomous communication; in particular,
10
they state the case against clusters with respect to IDISKs. They point out four weaknesses in
cluster architectures: 1) the I/O bus bottleneck, 2) system administration challenges, 3)
packaging and cost difficulties, and 4) inefficiency of desktop microprocessors for database
applications. The first three items are clear. The fourth points out that desktop microprocessors
are slightly more powerful than embedded microprocessors when executing database codes, but
are far more costly. We return to this point in discussion of the future of programmability in
Section 6.1. The IDISK work distinguishes itself from Active Disks primarily by arguing for
considerable resources on each disk; the current Active Disks proposals seek out applications
that require minimal computation per byte.
3.5 Summary
In this section, we have surveyed the motivations, designs and applications for programmable
disks. The most compelling motivation for executing application code on these devices is the
need to scale processing power with problem size; data-set sizes for important disk-bound
problems continue to grow rapidly. However, cluster-based systems are already in use for this
purpose, and they must be displaced for Active Disks to become a reality.
Special processor and memory architectures -- other than IRAM -- have not been investigated
for disks, as they have for network interfaces as we will see, because disk performance is already
limited by the mechanical speeds of the spindle and arm. Any microprocessor performance
increase will have a marginal effect on the overall performance of the disk. (Unless the device is
tailored to increase the physical properties of the disk, as in using DSP techniques to increase
effective density [Smotherman 1989].) The IDISK proposal contends that there is clear benefit,
however, in improving the computational resources afforded application level code executing at
the disk, presuming, as do the Active Disk proponents, that additional processing power can be
added for marginal cost.
4 Programmable Network Interfaces
Network performance is increasing dramatically, outpacing the increase in memory speeds, with
no end anticipated in the near future [Schoinas and Hill 1998]. This fact, coupled with the
limitations in server I/O bus performance described in the previous section, has motivated high-
performance NI designs built around powerful microprocessors that require minimal host CPU
interaction.
11
Network interface design issues have traditionally been categorized according to network
type: local-area networks (LAN), system-area networks (SAN) and massively parallel processing
networks (MPPs). Since the reliability, bandwidth, and latency characteristics of these network
types are converging, the primary distinction that remains, one that is a large performance factor,
is the location of the NI/host connection. LAN and SAN NIs typically connect to the host I/O
bus. MPP NIs ordinarily attach to the node processor’s memory bus or processor datapath [Dally
et al. 1992]. In this report, we focus on the LAN/SAN type NI. However, as we shall see, the
integration of network processors on these devices raises many of the same issues confronting
MPP NIs.
While no two NIs are identical, in the following discussion of NI operation and design, we
use the Myrinet [Boden and Cohen 1995] host interface, as depicted in Figure 3, as a running
example. The Myrinet system area network (SAN) was a ground-breaking advance in
interconnect technology. It was the product of two academic research projects, namely Caltech’s
Cosmic Cube [Seitz 1985] and USC’s Atomic LAN [Cohen et al. 1992]. The Myrinet host
interface is similar in the important ways to other high-performance interfaces, and the bulk of
the differences lie in the relative sophistication (or lack thereof) of the network processor.
4.1 Basic Operation
A switched network such as Myrinet consists of host network interfaces and switches. Myrinet
switches range in size from 4 to 16 ports. Myrinet is self-configuring and source routed – it uses
the blocking-cut-through (wormhole) routing technique found in MPP systems such as the Intel
Paragon and the Cray T3D. The switches have no hard state or software; they only steer the
source-routed packets. Moore's Law has helped make switched-based networks economical since
switches and crossbars can be implemented on a single chip.
Upon packet arrival, the link, or packet, interface handles framing, error detection, and the
media access protocol. The link interface accepts a frame via the media access protocol, checks
for errors via cyclic redundancy checks (CRCs), and writes the frame into the buffer memory.
Minimally, this buffer is used to cope with asynchrony between the network link and the host
interface. In many high-performance NIs, additional processing of the packet takes place, as
discussed further in Section 4.2.1. Once all processing is complete, the packet is moved via
DMA into the host CPU’s memory.
12
Unlike disks, networks are fast compared to modern processors; high-performance NIs are too
fast for host I/O busses. For example, one link in a Myrinet LAN has full-duplex bandwidth of
110 MBps (1.28 Gbps), which is greater than the 10 MBps peak bandwidth available on host PCI
I/O busses, which is shared by all I/O devices. As previously mentioned, networks have been
increasing in speed and bandwidth faster than memory [Schoinas and Hill 1998]. Consequently,
the performance of the microprocessor, memory and operating system that run the interface card
can have tremendous influence on the capabilities of the device for certain applications. The
design of high-performance processors and execution environments has become a research issue,
and, in the case of network processors, a fledgling industry full of start-up companies and
established semiconductor vendors. A major push behind these efforts is the need to meet the
increasing bandwidth and functionality requirements of the Internet. Traditionally, the middle of
the network has been kept simple and fast, with sophistication being implemented at the edges.
However, to meet these demands, functionality is being pushed from servers on the edge of the
network onto internal network nodes. Sophistication is moving into the network in the form of
application data caching, tunneling, content distribution techniques, etc. There remain
proponents on both sides of this issue. However, it seems unlikely that the services being
deployed in the network today can ever be reigned back in.
The tremendous momentum behind Internet related technologies has inspired much research
in: network interface design, communications based operating systems research, and large-scale
systems research that implement network services. In this section, we first consider the design of
modern network interfaces, including the design alternatives investigated in the research
literature. Then, we survey various proposals for exploiting the programmability in network
interfaces.
4.2 NI Organization
As was the case with disks, modern NIs have all the components found in computer systems:
processor, memory, and a communications subsystem. Figure 3 depicts how the Myrinet host
interface and most NIs are organized. In this section, we consider each of the major components
individually.
13
4.2.1 Processor
The most important function of the processor is to manage packet delivery and protocol specific
tasks. In Myrinet, for example, each network has a manager (chosen manually or automatically
by the network) that is responsible for continuously mapping the network by sending messages
to all hosts. This mapping enables source routing. So, in addition to managing host packets, the
processor must also adhere to the control protocol of the network.
The LANai processor, found on the Myrinet NI, is a simple, 32-bit RISC processor clocked at
33 MHz with integrated link and host interfaces. The LANai is a relatively meager processor
compared to processors found on other high-end devices. For example, the 3Com 3CR990
ethernet network interface is built around a 200 MHz ARM9 processor core; this device
aggressively handles security (IPsec) and TCP segmentation and reassembly completely on the
NI [3Com 2000].
Research directions in NI processors can be grouped in two categories: communication
processors and network processors. The first category emphasizes low-cost interrupt and
message handling, and the second focuses on higher-level packet processing.
Communication Processors. The I/O processors described earlier unburdened the host CPU
from handling all the details of data transfers. Similarly, communication processors are used on
NIs to manage data movement. Rather than using polling or interrupts on the host (or network
processor), NIs use smaller, less powerful communication processors to poll the network for
data. Significant work on programmable communication processors has been done in the context
of message-passing MPPs. For example, the Stanford FLASH multiprocessor project developed
a communication processor called MAGIC to manage the movement of all data between host
CPU, memory and the network [Kuskin et al. 1994]. MAGIC managed all data movement on a
processor node, thus, in addition to unburdening the host CPU, cache coherence and
Figure 3. Myrinet NI structure. Source: [Boden and Cohen 1995].
14
communication mechanisms were implemented in software. In the following paragraphs, we
consider: the effectiveness of communications processors, the feasibility of zero-overhead
application message handling on communication processors, and design concerns specific to
communication processors built around general-purpose microprocessors.
Recently, [Scheiman and Schauser 1998] evaluated network performance in an MPP both
with and without a communication processor using the Meiko CS-2 multiprocessor. Results
indicate that implementing application, or user, level message handlers on a communication
processor, despite being slower than the host CPU, improves latency. The authors report that the
improvement is due to (1) the faster response time of the communication processor, and (2) to
the task offloading that frees the main processor from polling or handling interrupts. Similar
performance evaluations have been published concerning the CMU Nectar communication
processor [Menzilcioglu and Schlick 1991, Steenkiste 1992].
Much work has been done on the support needed to perform zero-copy application and user
level messaging on high-speed NIs [Chien et al. 1998, Mukherjee 1998, von Eicken and Vogels
1998]. In one case, [Schoinas and Hill 1998] show that it is possible to perform zero-copy
application-level messaging in software on a communications processor. This minimal
messaging attempts to move data directly from the NI into the host data structures in main
memory. (Here, host refers to either the host CPU or the network processor, depending on which
is receiving the data.) The key issue is providing efficient virtual/physical memory address
translation [Chen et al. 1998] on the network interface.
Finally, some recent work describes the support needed to implement low level operations
[Cranor et al. 1999] involved with network-specific data transfer on microprocessor-based
communication processors. Specifically, they use multiple thread contexts to limit the overhead
involved with servicing DMA completion interrupts. This helps reduce overall message latency
when using a general-purpose embedded processor. This technique improves communication
processor performance by replacing costly polling with low-overhead interrupts. In cases where
additional packet processing requirements are low, this approach can remove the need for
separate communication and network processors.
Network Processor. The distinction between a communication processor and a network
processor is far from settled. A general description would be that communication processors
handle low-level data-link protocol details (e.g., ethernet or Myrinet specifics) and message
15
handling, while network processors perform high-level, network and transport layer processing
(e.g., IP and TCP/UDP processing). The tasks carried out by the communication processor are
the tasks traditionally performed by NIs. The network processor implements functionality found
in host device drivers and applications. This distinction is relatively new, but as the processors
found on NIs increase in power, the need for a communication processor to unburden the
network processor will grow for very reasons discussed above.
Industry is producing numerous network processors, most of which employ chip-
multiprocessor or fine-grained multithreaded processor architectures, to provide high-
performance on the NI [Crowley et al. 2000]. For example, the recently announced Prism
network processor from Sitera [Sitera 2000] is a 4 processor chip-multiprocessor with hardware
support for packet classification and quality of service.
Our work, performed here at UW, made the following contributions to network processor
research: 1) identified a set of network processor workloads, 2) showed that chip-multiprocessors
(CMP) [Nayfeh et al. 1996] and simultaneous multithreaded architectures (SMT) [Tullsen et al.
1995] can exploit packet-level parallelism, while aggressive superscalar and fine-grained
multithreaded architectures cannot, 3) showed that packet classification can be performed
economically in software on network processors, and 4) showed that SMT adapts better to the
variability in real multiprogrammed workloads [Crowley et al. 2000, a paper describing parts 3
and 4 is currently under review]. A problem with this work is that it only reports throughput. The
results give no intuition about how these processor designs impact latency, an oversight warned
against by [Hennessy and Patterson 1996]. It is likely that latency considerations are slight, given
the wide-area applications considered, but the subject should have been addressed.
4.2.2 Memory System
Memory serves as the bridge between the processor and the network. Network interfaces
typically use high-speed SRAM to buffer packets. The bandwidth and latency characteristics of
the memory system figure prominently in the amount of processing that can be performed on
each packet at network speeds. Surprisingly, there has been little reported in the research
literature on memory systems for high-performance network interfaces.
The general question of how to architect the memory system is open, although there is
significant discussion of this in the trade news. For example, the Prism network processor from
16
Sitera uses optional SRAM and is the first network processor to integrate a RAMBUS memory
controller [Crisp 1997].
Recent proposals have appeared in the trade news for integrated DRAM memories on network
processors. The idea of using IRAM to buffer packets is somewhat obvious, but unexplored.
However, latency is a very important consideration, and standard approaches of integrating
processors and memory may not help [Cuppu et al. 1999].
One related proposal uses a standard CPU cache memory to implement single-cycle IP route
lookups [Chiueh and Pradhan 1999]. Following work by the same authors includes cache
modifications to increase the effectiveness of this technique [Chiueh and Pradhan 2000]. This
work is rooted in finding longest-matching prefixes with the dynamic prefix trie (as in retrieval)
data structure [Doeringer et al. 1996]. The technique uses the cache as a hardware assist for
performing fast matches between addresses and next hop values; in essence, IP addresses are
treated as virtual memory addresses.
4.2.3 Communication
Network interfaces connect to a network media on one side, and the host interface on the other.
With an eye toward large-scale routers, several companies are developing very fast back-end
interconnects that permit high-bandwidth, low-latency transfers between network interfaces. The
proposals from the common switch interface (CSIX) consortium [CSIX 2000] and IX
archtitecture forum [LevelOne 1999] are seeking to standardize these interfaces. Not
surprisingly, research on interconnects for message-passing multiprocessors has inspired these
efforts. For example, Avici Systems, a start-up founded by Bill Dally from Stanford/MIT,
basically uses the J-Machine, and in particular its interconnect, to do terabit routing [Avici
2000].
4.3 Control Software
Control on the Myrinet interface is the responsibility of the Myrinet control program (MCP) that
is loaded into device memory on boot-up. The MCP implements network specific control
processing, such as network mapping, and handles DMA requests both within the interface and
into host memory. The research community has embraced Myrinet due in large part to its open
interfaces and open, modifiable MCP [Bhoedjang et al. 1998].
17
Key challenges in the design of control software include low overhead user-level messages,
because going through the OS for permissions checking is too slow, and utilizing a minimum
number of copy operations [Eicken et al. 1995]. A good survey of efficient techniques for user-
level messages is provided by [Bhoedjang et al. 1998]. Communication in general is very latency
sensitive – in many cases it matters just as much as throughput. SPINE describes a safe,
extensible system for executing application-specific code on programmable NIs [Fiuczynski et
al. 1998].
4.4 Applications
We have considered the motivations for and designs of programmable NIs. In this section, we
survey the applications and techniques that have been proposed for exploiting this
programmability.
4.4.1 Fast LANs/System Area Networks
As mentioned previously, Myrinet employs host controllers that were built around
microprocessors. Since the introduction of Myrinet, manufacturers have produced solutions for
fast LANs employing the same technique. For example, Asanté’s GigaNIX gigabit ethernet NI is
built around two 32-bit embedded RISC processors [Asanté 2000]. Programmability is generally
required in these devices since the network-specific control (i.e., network mapping, flow control)
is more easily and economically implemented in software on cost-effective embedded
microprocessors.
4.4.2 Computing at Network Speeds
Emerging network applications and services require a fast path that does not involve the latency
penalty associated with crossing the I/O bus to get to the host CPU. Examples of such services
include: IPsec, routing, server load-balancing, and quality-of-service (QoS). The additional
latency to get to the host CPU makes these services infeasible at network speeds. Hence, a
network processor is included on the network interface to execute these applications.
These applications are indicative of the general trend of pushing more computation and
sophistication into the network, as discussed in Section 4.2.1. Other examples of this trend
include web caching, network-address translation (NAT), firewalls, and virtual private networks
18
(VPNs). This trend has the potential to radically increase the computational resources required at
each link in the network. The execution of many applications at network speeds requires a
significant amount of processing power that, furthermore, scales with the number of network
connections in a computer system. This trend has particular significance for services running at
the backbone of large internetworks.
A special class of machines, traditionally called routers, service many network links
simultaneously. It is particularly necessary to execute network services on network interfaces in
these devices so that processing power, and hence overall service performance, can scale with
the number of links. A number of researchers have proposed the use of high-performance
programmable network interfaces connected via a fast interconnect to implement large-scale
routing systems [Peterson et al. 1999, Walton et al. 1998]. This proposal closely matches what is
actually taking place in industry.
4.4.3 Active Networks
Active networks is a new approach in network design that provides a customizable infrastructure
to support the rapid evolution of new transport and application services by enabling users to
upload code into programmable network nodes. This is the most aggressive example of
computing at network speeds: each packet can contain a unique program. The last few years have
seen considerable coverage of active network research. Directions of inquiry have included
designs for: software platforms and programming models [Wetherall et al. 1999] [Hicks et al.
1999], active network node architectures [Decasper et al. 1999] [Nygren et al. 1999], and
operating systems for active nodes [Merugu et al. 2000] including emphases on QoS [Alexander
et al. 2000] and security [Campbell et al. 2000]. Two recent articles comment on the results thus
far [Smith et al. 1999] and lessons learned [Wetherall 1999]. This proposal poses big challenges
in safety, performance, and management.
4.5 Summary
The preceding section surveyed the motivations, designs and applications for programmable
network interfaces. More so than disks, innovative design and research proposals for
programmable NIs are being investigated to meet the growing performance and functionality
requirements in next-generations networks.
19
5 Other Examples
There are other examples of peripheral devices that are now programmable, including graphic
display adapters and printers. Graphic display adapters for many years have implemented
graphics pipelines and other display primitives at the device. Considerable work has done for
graphics and media-specific programmable architectures [Basoglu et al. 1999, Rixner et al.
1998]. The relatively high bandwidth required for graphics on consumer PCs lead Intel to device
the accelerated graphics port (AGP) [Intel 2000]. The AGP bus uses the same I/O “switch” as the
processor and main memory in order to “fatten and shorten the pipe” between the processor on
the graphics card and main memory. Intel had noticed that graphics cards were beginning to ship
with significant amounts of memory, forcing the primarily graphics-based multimedia processors
to manage memory in a fashion similar to the host CPU. This extension also helps Intel’s MMX
extensions to speed graphics processing in ways that were infeasible across the standard
peripheral I/O bus.
Postscript printers have been programmable I/O devices from the beginning [Tennenhouse
and Wetherall 1996]. A postscript document is, in fact, a program generated by an application
that is sent to the printer and interpreted by the printer's control microprocessor.
6 Common Issues
In this section, we consider a set of issues common to both programmable disks and
programmable NIs.
6.1 The Future of Programmability
Is programmability here to stay? There are at least two reasons why these devices may cease to
be programmable: 1) ASICs become more cost-effective at implementing the required
functionality, or 2) vastly faster host CPUs connected to passive peripherals via fast, switched
I/O networks make relatively slow embedded processors a performance liability.
The first issue presumes that these devices do not need the flexibility offered by software, and
is, to a large extent, answered by the state of the industry today. The integration of a
microprocessor core with device specific hardware and interfaces seems to be the preferred
solution for the time being. However, if application specific hardware design were to
unexpectedly become fast and cost-effective, this could change. A general framework for
20
intelligent I/O devices is gaining industry support, and will likely help keep these devices
programmable [I2O Special Interest Group 1997].
The second issue is a more serious challenge. If the performance gap between desktop CPUs
and embedded processors begins to widen, and desktop machines adopt a fast switched I/O
interconnect between the CPU and peripherals [Mukherjee and Hill 1997], then clustering
passive, low-end disks with powerful CPUs may be more cost-effective than a system of high-
end active peripherals. [Keeton et al. 1998] do not expect this to happen. They cite
cost/power/price differences between desktop and embedded microprocessors that range between
5X and 20X that translate to SPECint 95 performance differences of only 1.5X to 2X. Desktop
CPU markets can afford to pay heavily for marginal improvements in performance on SPECint
95. However, these improvements are not justified or beneficial in embedded systems. [Uysal et
al. 2000] show that performance is comparable between clusters and active disk systems on the
workloads that inspired active disks; the only difference is cost. These researchers report active
disk system costs at less than half of the cluster system cost. Regardless, since people the world
over are currently programming clusters to solve real problems, and active disks do not exist yet,
this issue remains a serious challenge.
In any case, the trend of system level integration, which leads to systems-on-a-chip (SOCs), is
not likely to stop any time soon. Particularly given that communication, and wires, will continue
to be the expensive resource going forward, as memory and compute resources become nearly
infinite. This trend seems to lend itself to any solution that involves a high computation to
communication ratio.
6.2 Research Directions
In this section, we propose a number of research issues that face programmable peripherals. As
distributed systems go, the environmental conditions for systems of network peripherals are
pleasant: a high-performance and reliable interconnect, reasonable processing and memory
resources, and a single administrative domain. Given this environment, the items listed here can
be considered properties necessary or desirable in systems comprised of programmable
peripherals.
21
6.2.1 Comprehensive programming model
Programming model concerns include safe extensions, reasonable host/peripheral interaction,
reasonable host/host interaction, and reasonable support for scalable software. A number of
proposals have recommended programming models for individual device functions. One active
disk proposal [Acharya et al. 1998] describes a programming model for partitioning certain
database applications between hosts and disks, but the model does not address interactions with
or support for other types of applications. Furthermore, the programmer manages all
communication. This recalls the programming models for message-passing multiprocessors,
which are hard to program since all applications require heavy customization. For network
interfaces, pattern-based languages [Begel et al. 1999, Engler and Kaashoek 1996] and object-
oriented systems [Morris et al. 1999] for packet-classification, filtering and routing have been
proposed. These are heavily used, but no proposals have been made to integrate these functions
into a comprehensive programming model for network interfaces.
The software system needs to also provide protection from untrusted, malicious and faulty
code. The SPINE operating system [Fiuczynski et al. 1998] advocates the use of safely
extensible operating systems to govern the operation of these programmable peripherals. This
notion is descended from the work developed in user-extensible operating systems such as SPIN
[Bershad et al. 1995] and Exokernel [Engler et al. 1995].
6.2.2 Platform independence
Traditional disks and network interfaces are integrated into server and workstation operating
systems through device drivers. As application code migrates onto these peripherals, however,
application code compatibility becomes an issue. Across-the-board device compatibility is
necessary to keep the programmable peripheral market a commodity market, and therefore price
competitive with passive peripherals. This can be achieved via object-based interfaces, such as
CORBA and COM, or through the use of platform independent binary code executed on virtual
machines such as the Java VM [Sirer et al. 1999]. Platform independence is tightly integrated
with the overall design of the execution environment.
22
6.2.3 Support for unbalanced performance
Applications commonly exhibit "hot spots" in which certain portions of data require a relatively
greater amount of work. This issue has not been raised with the full-scan database workloads
initially considered for Active Disks, however it will be a concern for more general-purpose
applications of programmable disks. Similarly, as components fail, it is likely that older devices
will be replaced with newer ones with greater performance. This introduces an imbalance in the
ideal partitioning of work onto devices. The general need is for execution environment load
balancing support for applications and devices with varying performance characteristics.
6.2.4 Support for multiprogrammed workloads
The execution environment must also support multiple tasks simultaneously. The initial Active
Disk proposals have avoided this completely, limiting their studies to single applications. SPINE
has started to address this for NIs. In addition to balancing between applications, in devices such
as disks and network interfaces there are elements of real-time constraints that must be managed.
A disk controller, for example, must be able to schedule disk arm movements along with client
requests and buffering tasks, simultaneously. It is unclear that the resource sharing techniques
implemented in standard operating systems will work well under these conditions. With network
interfaces, certain applications, such as guaranteeing a particular quality of service, may require a
tighter integration between the mechanisms allocating processor resources and the mechanisms
allocating network resources.
6.2.5 Support for centralized control
As noted in [Sirer et al. 1999], centralized control makes some difficult problems much easier.
These problems include managing software versions, security, auditing, and performance
analysis. Centralized control that does not limit performance is a nice property in distributed
systems; an execution environment for programmable devices will provide it.
[Keeton et al. 1998] contend that IDISKS will do away with much of the administrative costs
associated with clusters by assuming that IDISKS will incur the support and maintenance costs
of disks rather than cluster nodes. This may be the case since, for example, diagnostic checks
may be simpler in integrated devices as compared to desktop-like systems where many
components may fail and need to be tested separately. In general, taking steps to ease the
23
administrative problem in distributed systems is a valuable line of research with enormous
potential impact.
7 Conclusion
This report has surveyed the motivations, designs and applications for programmable disks and
network interfaces. These programmable peripherals have all the basic components found in
computer systems: a microprocessor, memory, and a communications subsystem. Given today’s
technology trends, embedded processors, and, hence, programmability, are likely to have a
permanent place in these devices. In order to improve I/O performance, a number of proposals
have been made for migrating data- and application-specific functions from servers onto these
devices. Initial implementations have provided functionality and support for specific tasks, such
as a decision support database environment for disks and language and programming systems for
packet classification on network interfaces. Based on these finding, a remaining challenge for
this general approach is to provide a software model that incorporates a comprehensive
programming model and the right set of libraries and OS services, including security and
resource management, needed in these peripheral environments.
This report concludes with some specific areas of future research. These directions focus on
programmable network interfaces, and, in particular, network processor design, which is the
author’s current field of research.
1. Memory system design for network processors. This study compares the performance of
modern and proposed memory technologies and cache hierarchies for a selection of network
interface organizations.
2. Analytical performance model for network processors. This model incorporates the
parallelism found in network workloads, the parallelism exploited by modern network
processor architectures, and includes memory system parameters. The purpose is to guide
network interface provisioning and give intuition concerning the relative importance of
processor and memory improvements.
3. Thread scheduling on SMT for network processor workloads. Initial results have suggested
that more sophisticated thread scheduling policies on SMT may be beneficial for network
processor workloads. This study examines ideal resource allocation for these workloads and
investigates scheduling policies on SMT that approximate the ideal.
24
References
[3Com 2000] C. 3Com. The EtherLink® 10/100 PCI NIC with 3XP Processor. ,http://www.3com.com/technology/tech_net/tech_briefs/500907.html, 2000.
[Acharya et al. 1998] A. Acharya, M. Uysal, and J. Saltz. Active Disks: programming model, algorithms andevaluation. Proceedings of the Eigth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, pp. 81-91. San Jose, November 1998.
[Adams and Ou 1997] L. Adams and M. Ou. Processor Integration in a Disk Controller. IEEE Micro vol. 14, no.4, July 1997.
[Alexander et al. 2000] D.S. Alexander, W.A. Arbaugh, A.D. Keromytis, S. Muir, and J.M. Smith. SecureQuality of Service Handling: SQoSH. IEEE Communications vol. 38, no. 4, pp. 106-112, April 2000.
[Amdahl et al. 1964] G.M. Amdahl, G.A. Blaauw, and J. F. P. Brooks. Architecture of the IBM System/360.IBM Journal of Research and Development , April 1964.
[Anderson et al. 1996] T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patterson, D.S. Roselli, and R.Y. Wang.Serverless Network File Systems. ACM Trans. on Computer Systems vol. 14, no. 1, pp. 41-79, Feb. 1996.
[Arpaci-Dusseau et al. 1998] R.H. Arpaci-Dusseau, A.C. Arpaci-Dusseau, D.E. Culler, J.M. Hellerstein, andD.A. Patter. The Architectural Costs of Streaming I/O: A Comparison of Workstations, Clusters, andSMPs. Proceedings of the HPCA. Las Vegas, 1998.
[Asanté 2000] T. Asanté. GigaNIX Gigabit Ethernet Adapter. , http://www.asante.com/new/2000/GigaNIX.html,2000.
[Avici 2000] S. Avici. The Avici Terabit Switch Router. , http://www.avici.com, 2000.[Basoglu et al. 1999] C. Basoglu, R. Gove, K. Kojima, and J. O'Donnell. Single-Chip Processor for Media
Applications: The MAP1000TM. International Journal of Imaging Systems and Technology, 1999.[Begel et al. 1999] A. Begel, S. McCanne, and S.L. Graham. BPF+: Exploiting Global Data-Flow
Optimization in a Generalized Packet Filter Architecture. Proceedings of the ACM CommunicationArchitectures, Protocols, and Applications (SIGCOMM ’99), 1999.
[Bershad et al. 1995] B.N. Bershad, S. Savage, P. Pardyak, E.G. Sirer, M.E. Fiuczynski, D. Becker, S. Eggers,and C. Chambers. Extensibility, Safety and Performance in the SPIN Operating System. Proceedings ofthe Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (SOSP), December1995.
[Bhoedjang et al. 1998] R.A.F. Bhoedjang, T. Ruhl, and H.E. Bal. User-level network interface protocols.Computer vol. 31, no. 11, pp. 53-60, Nov. 1998.
[Boden and Cohen 1995] N. Boden and D. Cohen. Myrinet -- A Gigabit-per-Second Local-Area Network. IEEEMicro, 15(1):29-36, 1995.
[Campbell et al. 2000] R.H. Campbell, L. Zhaoyu, M.D. Mickunas, P. Naldurg, and Y. Seung. Seraphim:dynamic interoperable security architecture for active networks. Proceedings of the IEEE 3rd Conf. onOpen Arch. and Network Programming, pp. 55-64, 2000.
[Chen et al. 1998] Y. Chen, C. Dubnicki, S. Damianakis, A. Bilas, and K. Li. UTLB: A Mechanism forAddress Translation on Network Interfaces. Proceedings of the Proceedings of the Eighth InternationalConference on Architectural Support for Programming Languages and Operating Systems (ASPLOS),October 1998.
[Chien et al. 1998] A.A. Chien, M.D. Hill, and S.S. Mukherjee. Design Challenges for High-PerformanceNetwork Interfaces. IEEE Micro:42-44, 1998.
[Chiueh and Pradhan 1999] T.-c. Chiueh and P. Pradhan. High-Performance IP Routing Table LookupUsing CPU Caching. Proceedings of the INFOCOMM ’99, pp. 1421-1428, 1999.
[Chiueh and Pradhan 2000] T.-C. Chiueh and P. Pradhan. Cache Memory Design for Network Processors.Proceedings of the 6th Int’l Symp. on High-Performance Computer Architecture, January 2000.
[Cirrus Logic 2000] I. Cirrus Logic. New Open-Processor Platform Enables Cost-Effective System-on-a-chipSolutions for Hard Disk Drives. , http://www.cirrus.com/3ci, 2000.
[Cohen et al. 1992] D. Cohen, G. Finn, R. Felderman, and A. DeSchon. The Atomic Lan. Proceedings ofthe IEEE Workshop on the Arch. and Impl. of High Performance Communication Subsystems, 1992.
[Cormier et al. 1983] R.L. Cormier, R.J. Dugan, and R.R. Guyette. System/370 Extended Architecture: TheChannel Subsystem. IBM J. Res. develop, 27(3):206-218, 1983.
[Cranor et al. 1999] C.D. Cranor, R. Gopalakrishnan, and P.Z. Onufryk. Architectural Considerations forCPU and Network Interface Integration. Proceedings of the Hot Interconnects. Stanford, CA, 1999.
25
[Crisp 1997] R. Crisp. Direct RAMbus Technology: the New Main Memory Standard. IEEE Micro vol. 17, no.6, pp. 18-28, March 1997.
[Crowley et al. 2000] P. Crowley, M.E. Fiuczynski, J.-L. Baer, and B.N. Bershad. Characterizing ProcessorArchitectures for Programmable Network Interfaces. Proceedings of the International Conference onSupercomputing, pp. 54-65. Santa Fe, N.M., May 8-11 2000.
[CSIX 2000] CSIX. CSIX: The Common Switch Interface Consortium. , http://www.csix.org/, 2000.[Cuppu et al. 1999] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of
Contemporary DRAM Architectures. Proceedings of the 26th Int’l Symp. on Computer Architecture, pp.222-233, 1999.
[Dally et al. 1992] W.J. Dally, J.A.S. Fiske, J.S. Keen, R.A. Lethin, M.D. Noakes, P.R. Nuth, R.E. Davison,and G.A. Fyler. The Message-Driven Processor: A Multicomputer Processing Node with EfficientMechanisms. IEEE Micro:23-39, 1992.
[Decasper et al. 1999] D.S. Decasper, B. Plattner, G.M. Parulkar, C. Sumi, J.D. DeHart, and T. Wolf. AScalable High-Performance Active Network Node. IEEE Network vol. 13, no. 1, pp. 8-19, Jan.-Feb. 1999.
[Doeringer et al. 1996] W. Doeringer, G. Karjoth, and M. Nassehi. Routing on Longest-Matching Prefixes.IEEE/ACM Trans. on Networking vol. 4, no. 1, pp. 86-97, Feb. 1996.
[Eicken et al. 1995] T.v. Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interfacefor Parallel and Distributed Computing. Proceedings of the 15th ACM Symp. on Operating SystemsPrinciples, pp. 40-53, 1995.
[Engler and Kaashoek 1996] D.R. Engler and M.F. Kaashoek. DPF: Fast, Flexible Message Demultiplexingusing Dynamic Code Generation. Proceedings of the ACM Communication Architectures, Protocols, andApplications (SIGCOMM ’96), 1996.
[Engler et al. 1995] D.R. Engler, M.F. Kaashoek, and J. O’Toole. Exokernel: an operating systemarchitecture for application-level resource management. Proceedings of the 15th ACM Symp. on OperatingSystems Principles, pp. 251-266, 1995.
[FCIA 2000] FCIA. Fibre Channel Technology Overview. . The Fibre Channel Industry Association,http://www.fibrechannel.org/, 2000.
[Fiuczynski et al. 1998] M.E. Fiuczynski, R.P. Martin, T. Owa, and B.N. Bershad. SPINE: An Operating Systemfor Intelligent Network Adapters. Proceedings of the Eighth ACM SIGOPS European Workshop, pp. 7-12.Sintra, Portugal, September 1998.
[Gibson et al. 1997] G. Gibson, D. Nagle, K. Amiri, F.W. Chang, E. Feinberg, H. Gobioff, C. Lee, B. Ozceri,E. Riedel, D. Rochberg, and J. Zelenka. File Server Scaling with Network-Attached Secure Disks.Proceedings of the SIGMETRICS, June 1997.
[Gray 1998] J. Gray. Put Everything in the Disk Controller. , ’98 NASD workshop,http://research.microsoft.com/~gray/talks/Gray_NASD_Talk.ppt, 1998.
[Hennessy and Patterson 1996] J.L. Hennessy and D.A. Patterson. Computer Architecture: A QuantitativeApproach, Second Edition. Kaufman Publishers, 1996.
[Hicks et al. 1999] M. Hicks, J.T. Moore, D.S. Alexander, C.A. Gunter, and S.M. Nettles. PLANet: AnActive Internetwork. Proceedings of the INFOCOM, pp. 1124-1133, 1999.
[Hill et al. 2000] M.D. Hill, N.P. Jouppi, and G.S. Sohi. Readings in Computer Architecture. , First ed. MorganKaufmann, 2000.
[I2O Special Interest Group 1997] I2O Special Interest Group. Intelligent I/O (I2O) Architecture Specification
v1.5. , Available from www.i2osig.org, March 1997.[Intel 2000] C. Intel. Accelerated Graphics Port Technology. ,
http://www.intel.com/technology/agp/index.htm, 2000.[Keeton et al. 1997] K. Keeton, R. Apraci-Dusseau, and D.A. Patterson. IRAM and SmartSIMM:
Overcoming the I/O Bus Bottleneck. Proceedings of the Workshop on Mixing Logic and DRAM: Chipsthat Compute and Remember, June 1997.
[Keeton et al. 1998] K. Keeton, D.A. Patterson, and J.M. Hellerstein. A Case for Intelligent Disks (IDISKS).SIGMOD Record vol. 27, no. 3, Nov. 1998.
[Kuskin et al. 1994] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D.Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASHMultiprocessor. Proceedings of the 21st Int. Symp. on Computer Architecture, pp. 302-313, April 1994.
[LevelOne 1999] LevelOne. IX Architecture Whitepaper. An Intel Company, 1999[Menzilcioglu and Schlick 1991] O. Menzilcioglu and S. Schlick. Nectar CAB: a high-speed network processor.
Proceedings of the 11th Int’l Conf. on Distributed Computing Systems, pp. 508-515, 1991.
26
[Merugu et al. 2000] S. Merugu, S. Bhattacharjee, E. Zegura, and K. Calvert. Bowman: A Node OS for ActiveNetworks. Proceedings of the INFOCOM 2000, pp. 1127-1136, 2000.
[Moore 1965] G.E. Moore. Cramming more components onto integrated circuits. Electronics , pp. 114-117, April1965.
[Morris et al. 1999] R. Morris, E. Kohler, J. Jannotti, and M.F. Kaashoek. The Click Modular Router.Proceedings of the 17th ACM Symp. on Operating Systems Principles, Dec. 1999.
[Mukherjee 1998] S.S. Mukherjee. Design and Evaluation of Network Interfaces for System AreaNetworks. Computer Science, pp. 189. University of Wisconsin, Madison, 1998.
[Mukherjee and Hill 1997]S.S. Mukherjee and M.D. Hill. A Case for Making Network Interfaces Less Peripheral.Proceedings of the Hot Interconnects V. Stanford, August 1997.
[Nayfeh et al. 1996] B.A. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of Design Alternatives for aMultiprocessor Microprocessor. Proceedings of the 23rd International Symposium on ComputerArchitecture, pp. 67-77, May 1996.
[Nygren et al. 1999] E.L. Nygren, S.J. Garland, and M.F. Kaashoek. PAN: A High-Performance ActiveNetwork Node Supporting Multiple Mobile Code Systems. Proceedings of the 2nd Conf. on OpenArchitectures and Network Programming, pp. 78-89, 1999.
[Patterson et al. 1997] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas,and K. Yelick. A Case for Intelligent RAM: IRAM. IEEE Micro:34-44, 1997.
[Patterson and Keeton 1998] D. Patterson and K. Keeton. Hardware Technology Trends and DatabaseOpportunities. , Slides from SIGMOD’98 Keynote Address, 1998.
[Patterson et al. 1988] D.A. Patterson, G. Gibson, and R.H. Katz. A case for redundant arrays of inexpensivedisks (RAID). Proceedings of the ACM SIGMOD Conference. Chicago, IL, June 1988.
[Peterson et al. 1999] L. Peterson, S. Karlin, and K. Li. OS Support for General-Purpose Routers. Proceedingsof the HotOS Workshop, March 1999.
[Riedel 1999] E. Riedel. Active Disks - Remote Execution for Network-Attached Storage. CMU, DoctoralDissertation, Tech. Report CMU-CS-99-177, Nov. 1999 Pittsburgh, PA.
[Riedel et al. 1998] E. Riedel, G. Gibson, and C. Faloutsos. Active Storage for Large-Scale Data Mining andMultimedia. Proceedings of the VLDB, Aug. 1998.
[Rixner et al. 1998] S. Rixner, W.J. Dally, U.J. Kapasi, B. Khailany, A. López-Lagunas, P.R. Mattson, andJ.D. Owens. A Bandwidth-Efficient Architecture for Media Processing. Proceedings of the 31st Int’lSymp. on Microarchitecture, pp. 3-13, Nov. 1998.
[Ruemmler and Wilkes 1994] C. Ruemmler and J. Wilkes. An introduction to disk drive modeling. IEEEComputer vol. 27, no. 3, pp. 17-28, 1994.
[Scheiman and Schauser 1998] C.J. Scheiman and K.E. Schauser. Evaluating the Benefits of CommunicationCoprocessors. Journal of Parallel and Distributed Computing, 57(2):236-256, 1998.
[Schoinas and Hill 1998] I. Schoinas and M.D. Hill. Address Translation Mechanisms in Network Interfaces.Proceedings of the 4th Int’l Symp. on High Performance Computer Architecture, 1998.
[Seitz 1985] C.L. Seitz. The cosmic cube. Communications of the ACM vol. 28, no. 1, pp. 22-33, 1985.[Sirer et al. 1999] E.G. Sirer, R. Grimm, A.J. Gregory, and B.N. Bershad. Design and implementation of a
distributed virtual machine for networked computers. Proceedings of the 17th ACM Symp. on OperatingSystems Principles, pp. 202-216, Dec. 1999.
[Sitera 2000] C. Sitera. The PRISM IQ2000 Network Processor Family. , http://www.sitera.com, 2000.[Smith et al. 1999] J.M. Smith, K.L. Calvert, S.L. Murphy, H.K. Orman, and L.L. Peterson. Activating
Networks: A Progress Report. IEEE Computer Magazine vol. 32, no. 4, pp. 32-41, April 1999.[Smotherman 1989] M. Smotherman. A Sequencing-Based Taxonomy of I/O Systems and Review of
Historical Machines. Computer Architecture News, 17(5):5-15, 1989.[Steenkiste 1992] P. Steenkiste. Analysis of the Nectar Communication Processor. Proceedings of the IEEE
Workshop on the Arch. and Impl. of High Perf. Comm. Subsystems, pp. 1-3, 1992.[Tennenhouse and Wetherall 1996] D.L. Tennenhouse and D.H. Wetherall. Towards an Active Network
Architecture. ACM Computer Communications Review, 26(2):5-18, 1996.[Tullsen et al. 1995] D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multithreading: Maximizing On-Chip
Parallelism. Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 392-403. Santa Margherita Ligure, Italy, June 1995.
[Uysal et al. 2000] M. Uysal, A. Acharya, and J. Saltz. Evaluation of Active Disks for Decision SupportDatabases. Proceedings of the 6th Int’l Symp. on High-Performance Computer Architecture, pp. 337-348,2000.
27
[von Eicken and Vogels 1998] T. von Eicken and W. Vogels. Evolution of the Virtual Interface Architecture.IEEE Micro , pp. 61-68, November 1998.
[Walton et al. 1998] S. Walton, A. Hutton, and J. Touch. Efficient High-Speed Data Paths for IP Forwardingusing Host Based Routers. Proceedings of the Proceedings of the Ninth IEEE Workshop on Local andMetropolitan Area Networks, May 1998.
[Wang et al. 1999] R.Y. Wang, T.E. Anderson, and D.A. Patterson. Virtual Log Based File Systems for aProgrammable Disk. Proceedings of the Third USENIX Operating System Design and ImplemenationConference. New Orleans, LA, February 1999. USENIX.
[Wetherall 1999] D. Wetherall. Active network vision and reality: lessons from a capsule-based system.Proceedings of the 17th ACM Symp. on Operating Systems Principles, pp. 64-79, Dec. 1999.
[Wetherall et al. 1999] D. Wetherall, J. Guttag, and D. Tennenhouse. ANTS: Network Services Without the RedTape. IEEE Computer Magazine vol. 32, no. 4, April 1999.
[Yang and Lebeck 2000] C.-L. Yang and A.R. Lebeck. Push vs. Pull: Data Movement for Linked Data Structures.Proceedings of the International Conference on Supercomputing, pp. 176-186. Santa Fe, N.M., May 8-112000.