WatsonBruce Ikadega sample 1

Introduction to DirectPath subsystems

PROPRIETARY and CONFIDENTIAL. NDA REQUIRED. 8/14/2001 10:54 AM Copyright Ikadega, Inc. All rights reserved.

Introduction to DirectPath subsystems (\\docs\techpubs\internal_docs\subsystem_intro.doc)

This document contains overview information on the DirectPath™ subsystems. The information comes from the Ikadega online documentation; the online component, and not this document, will be the version that will be kept current. This document will be updated from time to time from the online documents.

Information is currently available for a set of subsystems. More will follow over time.

Note: DirectPath is an evolving and changing system. This document describes the future vision for each subsystem – how it is expected to look at some future point (such as when the product first ships). Many parts of the design as described in this document have not yet been implemented.

Note: Underlined terms are defined in the Ikadega glossary.

Document contents

Internet delivery subsystem.................................................................................................2 Component life cycles .....................................................................................................4

The TV delivery and MPEG platform subsystems..............................................................6 How hospitality systems work.........................................................................................6 How ad insertion works...................................................................................................8 The jukebox model ..........................................................................................................9 The interactive model (hospitality only) .......................................................................10

The volume and file access subsystems ............................................................................11 File system service layers..............................................................................................12 Typical uses of the file system ......................................................................................12 The UNIX file system ...................................................................................................13 The file access subsystem..............................................................................................13

Access to smaller, named files ..................................................................................14 Block search services for the volume access subsystem ...........................................14

The volume access subsystem.......................................................................................16 Aggregation ...............................................................................................................17 Striping ......................................................................................................................18

The hardware layer ........................................................................................................18 Checkpoints ...................................................................................................................19

A simple example ......................................................................................................19 Checkpointing states and transitions .........................................................................23

Checkpointing states ............................................................................................23 State transitions ....................................................................................................24

Replication.....................................................................................................................25 Replication and checkpointing ..................................................................................26

Content transfer engine subsystem....................................................................................27 Inside the CTE...............................................................................................................28



This document describes the subsystems in the groupings shown in the Introduction to DirectPath:

Internetdelivery Manager

File accessIP messagingVolume access

Traffic & array control

ITM

Platforms:

Content transfer engine (CTE)

Controller event engine (CEE)

Open source environment (OSE)MPEG platform (MPP)

TV (MPEG)delivery

Core services

Application subsystems

Content transfer engine extension (CTEX)

See the introductory document for high-level descriptions of the subsystems. This document contains more detailed descriptions.

Internet delivery subsystem

The Internet delivery subsystem drives the process of sending content to Internet users. This picture shows its important components:

. . .

HTTP CTDs

HTTP XCTDs

FTP CTDs

FTP XCTDs

RTP CTDs

RTP XCTDs

Internet delivery subsystem

The subsystem contains a large number of content transfer daemons (CTDs) – generally a separate one for each end-user session. (End users can have multiple sessions at the same time.) The CTDs all run the same code, but each has its own event queue and a small amount of private memory in external RAM.

A CTD’s primary function is data transfer. It has a limited set of commands and functions, but this allows it to perform them very efficiently. Most of the traffic it handles is bound for clients outside the DirectPath system. Its main task is to receive data from the fabric, then place this data into outgoing message frames for the Internet. The daemon is optimized for data transfer and does only minimal processing of the data.

Most CTDs have corresponding extended content transfer daemons (XCTDs). XCTDs handle the non-routine processing that the CTDs do not do – the more complex error and



exception processing. CTDs and XCTDs communicate with each other via either ITM or IP messaging.

The CTDs run in an FPGA on an IP access node. XCTDs run in a supplemental processor, which is either on an IP access node or a supplemental processor node. The content transfer engine (CTE) is the logical platform for the CTDs. For the XCTDs, the platform is the content transfer engine extension (CTEX).

FPGASupplemental

processor

CTD XCTD

CTE CTEX

In some applications there is one XCTD per CTD, but in other applications an XCTD might oversee several CTDs. For example, in a streaming video application, one XCTD might work with the following CTDs, each of which processes a different type of information:

XCTDRTSP CTD

HTML CTD

RTP CTD

For selecting titlesto download

For negotiating xferparameters -- speed,

format, etc.

For downloadingselected titles

There are several categories of CTDs and XCTDs – one category for each Internet service supported by the system (HTTP, FTP, RTP, etc.). Event engines in the content transfer engine (CTE) alert CTDs when there are events for them to process. This is the flow of control for a single CTD/XCTD pair:

Event engine

CTE Internet delivery subsystem

CTDDispatch signal

XCTD

Exceptions andcomplex tasks

Commands andreplies



This is the overall environment in which a single CTD and XCTD pair operates to handle one user session:

CTD XCTD

Web serverdaemons

OSE OS

File accesssession

Volume accesssession

Traffic/arraycontrol

Client (e.g., enduser browser)

Component of the Internet delivery subsystem

Ikadega-developed DirectPath object not in the Internet delivery subsystem

Third-party component not developed by Ikadega

External resource

The Web server daemons run in the DirectPath system’s open source environment (OSE). They must have a very specific server configuration to run effectively with the rest of the system.

The Web delivery client is an Internet application such as a Web browser or FTP program. In certain cases, there may also be one or more external resources for the subsystem to deal with. One example of this is a credit card validation/approval system in an e-commerce application. Depending on how complex the processing is, the interface to an external resource could be handled by either the CTD (if only simple processing is needed, such as reading cookies) or the XCTD (for more complicated processing).

In the initial versions of the system, the XCTD communicates directly with the Web server daemons. In future versions, this communication may instead go through the OSE’s operating system.

Component life cycles

The system creates a fixed-size pool of CTDs at boot time, which it allocates one by one for each new end-user session. The number of possible CTDs generally remains fixed. If the system exhausts the CTD supply, it cannot create new end-user sessions until a CTD is de-allocated. This protects the system against denial-of-service (DOS) attacks. By limiting the number of possible sessions, the system can continue running if it receives numerous session requests, though it may be temporarily unable to allow new sessions. (System operators can set the size of the CTD pool in the Web-based configuration utility.)

Since XCTDs run on a supplemental processor, which has an operating system and a richer execution environment, there is not a fixed set of XCTDs. The system creates new ones as needed.



This is the life cycle of a CTD/XCTD pair supporting a typical end user HTTP session (assuming a 1:1 relationship between CTDs and XCTDs):

1. When the system receives a request for a new user session, it allocates a CTD and XCTD (creating a new one as necessary). The system does handshaking with the client to determine the session type (HTTP, FTP, RTP, etc.). It configures the CTD/XCTD pair and initializes a context accordingly.

2. As described above, either the XCTD or CTD might take part in authorizing and validating the end user session.

3. Through an event engine in the content transfer engine (CTE), the CTD receives a client request to transfer a file. To the CTD, this is simply a command it is not programmed to process, so it passes it to the XCTD.

4. The XCTD receives the file transfer request and attempts to validate the transfer – checking to see if the requested file exists and if the end user is authorized to receive it, etc.. If it successfully validates the request, the XCTD generates a handle for the requested file and passes it to the CTD. Then it tells the CTD to transfer the file.

5. The CTD begins the process of requesting data and preparing it to go out to the Internet, using various services from other subsystems. At this stage, the XCTD only becomes involved if there is an error or an exception, or if processing is needed that the CTD does not know how to do. During the file transfer, the XCTD knows what file is being downloaded but does not know any details about the download, such as how much data has been sent so far. The CTD knows these details but does not know what file it is processing.

6. The CTD notifies the XCTD when the file transfer is done.

7. The previous steps repeat for each subsequent file download requested by the client.

8. When the client sends a request to end the session, the CTD passes it to the XCTD (again since the CTD is not programmed to process the request). The XCTD de-allocates itself and the CTD.



The TV delivery and MPEG platform subsystems

The DirectPath TV delivery and MPEG platform subsystems are closely tied together. This document describes them both. These two subsystems deliver digital video to support two DirectPath applications:

• Hospitality – A hospitality system provides in-room, on-demand video content to local end users. (Future hospitality systems may also support in-room Web browsing, as described later in this document.)

• Ad insertion – In this application, a customer such as a cable TV provider uses a DirectPath system to insert their own advertisements or other content into a video signal sent to cable subscribers.

Media server is Ikadega’s name for a DirectPath system used in either of these applications.

How hospitality systems work

In a typical hospitality application, one or more DirectPath media servers deliver digital movies to hotel guests. The content can also include things like short advertising videos for other nearby businesses.

The following picture shows the devices involved in hospitality delivery. The only Ikadega-supplied component is the media server. The customer supplies and manages the rest.

Media server(DirectPath system)

Facility cableplant

TV set

End user's room

End useragent



To watch a movie, the user interacts with the customer’s end user agent system rather than with DirectPath. This is the normal sequence of events when an end user wants to use hospitality services:

1. The end user, on an in-room TV, makes a request to watch a movie or other program (via an input device such as a remote control).

2. The customer’s end user agent receives this request and queries the media server for information on the available content.

3. The media server sends the end user agent data on all the selections available, including the title, running time, description, rating, etc. for each content file.

4. The end user agent takes this information to display menus and help the end user make a selection. The agent also makes any necessary billing arrangements.

5. When the guest makes a selection, the end user agent directs the media server to begin playback of the requested program to a specific media server port. The end user agent tunes the user’s TV to the correct channel to receive the program. This channel change is invisible to the end user – it does not change the channel number displayed on the TV.

6. The media server plays the selection as requested. It sends the signal directly to the end user’s TV through the building’s cable plant. The user may pause or halt playback at any time. The only involvement the end user agent has during this phase is to pass any pause/restart/stop commands to the media server.

7. The media server notifies the end user agent when playback is done.

Communication from the agent goes through an end user agent proxy. This is an application that runs in the open source environment. It handles communication between the end user agent and the DirectPath controller (DPC). The DPC takes action as appropriate, which often affects the TV access node.

Externalsystem Proxy task DPC

TV accessnode

Media server

OSE



How ad insertion works

Ad insertion allows a local cable company to substitute its own commercials (usually for local businesses) for those in the input broadcast (which are often made for a national audience). This picture shows the major components involved:

Media server(DirectPath system)

Adscheduler

Contentloader

Cable TVhead end

A/Bswitch

"Go"command

Signal

Newads

Ad source

Subscribers

Numerous channels of content arrive at the cable TV head end. If there is no ad insertion happening for a particular channel, that channel’s signal passes unchanged through the A/B switch and on to the subscribers watching that channel. However, when the head end receives notification that a commercial is about to start, it signals the ad scheduler system.

The ad scheduler has a list of the commercials stored on the media server. It decides whether to replace the national ad with one of these commercials, and then it chooses the commercial to run. The scheduler sends a command to the media server to play that ad on the specified channel. The ad scheduler also uses a proxy task to communicate with the media server.

The media server immediately begins to play the commercial. The A/B switch replaces the signal coming from the head end with the media server’s output signal. The subscribers watching that channel see the ad being played by the media server.

From time to time, the content loader receives new digital ads. It passes them on to the media server for storage, and it also notifies the ad scheduler, so the scheduler has a current list of which commercials are stored in the media server. The ad scheduler and content loader can run on the same machine or different machines.



The jukebox model

The early versions of the media server are designed around a jukebox model, in which it plays the selection it’s told to by an outside system (the end user agent or ad scheduler). A later section of this document describes the interactive model, to be implemented sometime in the future. Whether it’s playing movies or inserting ads, the DirectPath system has hardware and software in its TV access nodes to support video playback:

Sub-node 0

MPEG drivers

OS-9

DAVID

Application

MPEG decoder (and relatedcomponents)

Sub-node 1

Microprocessor

Node-fabricinterface

TV signal

Controldata

MPEGimagestream

Ikadega component Third-party component

TV access node

There are eight sub-nodes on a TV access node, each of which produces one video signal. Notice that one node-fabric interface (NFIF) handles fabric communication for all of them. Most of the data arriving at the NFIF is the video data, which it passes directly to the appropriate MPEG decoder rather than to the microprocessor. (This is similar in philosophy to how DirectPath storage nodes pass data directly to access nodes without going through the DirectPath controller.) The data going from the microprocessor to the node-fabric interface includes requests for more content data from the storage nodes.

The Ikadega application running in the microprocessor would work on tasks such as closed captioning, providing visuals to accompany audio-only content, and superimposing text or graphics over the video for weather warnings, logos and other images. The MPEG decoder’s “related components” from the previous drawing include logic to support internal MPEG transport, superimposing, and audio-video mixing.

Subsystem information: the Ikadega microprocessor application and end user agent proxy are part of the TV delivery subsystem, while the other sub-node components are in the MPEG platform subsystem.



The interactive model (hospitality only)

In future versions of hospitality systems, the user will interact with a Web browser running in the media server. This provides a more appealing and functional selection system that that provided by the original end user agent, which tends to be character-oriented. The design might look like this:

Sub-node 0

Drivers

OS-9

DAVID

Browser

MPEG decoder (and relatedcomponents)

Sub-node 1

Microprocessor

Node-fabricinterface

TV signal

Controldata

MPEGimagestream

Ikadega component Third-party component

Applet

Created by VAR

These are the major differences in the interactive model:

• End users will be able to go on the Internet from their rooms.

• End users who would rather watch a movie than go on the Internet will select content via an applet running in a browser in the microprocessor. The customer or VAR will probably create this applet.

• Since the end user will use the browser to make content selections, the end user agent has a reduced role – it simply passes keystrokes between the end user and the browser.

• There is no interactive model for ad insertion.

Subsystem information: the applet and end user agent proxy are in the TV delivery subsystem, while the other sub-node components are in the MPEG platform subsystem.



The volume and file access subsystems

The volume access subsystem and file access subsystem are the DirectPath file system. These subsystems support the reading and writing of data on storage node hard drives.

The file system can accommodate a wide range of uses. In some applications, such as hospitality systems that primarily play back movies to locally connected TV sets, the file system holds a relatively small number (in the range of hundreds) of very large files. These files do not change very frequently, and owners load new files relatively infrequently (say on a daily or weekly basis). Other customers, however, will use DirectPath to host and deliver Web sites. These customers need a file system that can handle large numbers (in the tens of thousands) of small files that change relatively frequently. Between these two extremes are customers like an online music service, who must deliver one set of small files (say the Web pages where users select songs to download) and another set of fairly large ones (the actual MP3 song files). The DirectPath file system has flexibility to accommodate these varying uses in one design.

The file system consists of two DirectPath subsystems:

• Volume access subsystem – In DirectPath, a volume is a logically continuous set of disk sectors. The volume access subsystem is unaware that some volumes contain multiple files.

• File access subsystem – A DirectPath file is a named portion of a volume. File accesses go through the volume access subsystem.

Notice from this drawing that all disk accesses go through the volume access subsystem, either directly or through the file access subsystem:

Clienttask

Volume accesssubsystem

File accesssubsystem

Accessing a file

Accessing a volume



File system service layers

You can think of the DirectPath file system as a collection of services divided into the following layers:

Implemented here: aggregation,replication, striping, checkpoints.Used here: block search servicesfrom the file layer.

ApplicationsRequest and work

with the data

File subsys.Identifies & manages

named data files

Volume subsystemLocates & places the

data on disk

Hardware layerReads and writes the

data

UNIX filesystem

Implemented here: block searchservices for the volume layer.Used here: checkpoints.

The remaining introductory pages describe these components, from the higher-level directory layer and UNIX file system to the low-level hardware layer.

Typical uses of the file system

DirectPath can actually support multiple file systems running concurrently. The Ikadega-supplied file system can run together with the UNIX file system. It can also exist in the same machine as an optional customer-defined file system.

Below are some examples of how customers could used the Ikadega-supplied file systems:

• Local large content delivery, where the system delivers very large files to nearby users – for example, movies to hotel guests. In this scheme, there usually is only one company providing the content. Since the data for a content file is not likely to change, the content rarely if ever goes through different versions. What does change over time is the set of movies available – new ones are added and older ones might be removed. The volume access subsystem provides the services for this type of use. In this type of system, the file access subsystem exists but is essentially empty – it just passes I/O requests to the volume access subsystem with little or no processing.



• Internet delivery, where the system hosts numerous Web sites containing various file types, from small files to movies. The content on these sites comes from a number of content providers, and from time to time the system owner may need to find out who created a certain file. While some of these files may be as large as the movies described above, there are probably also a number of small files. The system must be able to locate and process all of these files. It must also be able to deal with them being replaced frequently. In applications like this, the system relies on the services of the volume and file access subsystems. The file access subsystem processes these files.

• UNIX file access, described in the next section.

Most DirectPath systems have a mixture of these file types.

The UNIX file system

A complete UNIX file system may exist in DirectPath to support legacy applications and other programs that need UNIX services. One example of this is the system event logging, which could be implemented by using UNIX logging services. Also, the Manager subsystem, designed to be as UNIX-like as possible, uses UNIX services.

The UNIX file system can perform reads and writes on its own private disk (a disk invisible to the other file systems), or it can use the volume access subsystem for disk access, or it can do both. The private disk is currently used in system booting. When it uses the volume access subsystem for disk access, the UNIX file system has its own volume, which it thinks is an entire disk. It isn’t aware that the volume access subsystem is even there.

Possible future directions: In media server applications that mainly deliver digital video, it may be possible to use the UNIX file system as the only file service, without using the volume or file access subsystem. (The file access subsystem doesn’t do much in these applications anyway). It’s also possible, though, that the file access subsystem might take over all the functions of the UNIX file system in future versions of the system.

The file access subsystem

The file access subsystem is the highest layer in the file system hierarchy. The nature of its processing depends on the type of data being processed. The subsystem is mostly transparent when processing large content files such as digital movies or music – the volume access subsystem does most of the work on these files. The file access subsystem becomes important when the system works with numerous, small content files. For example, if a customer uses a DirectPath system for hosting Web sites, each site will have a number of relatively small files, and the file access subsystem would process the individual files in the site (HTML, GIF, etc.).

One key feature of the DirectPath file system is that virtually all of the content transfer from disk happens in the volume access subsystem rather than the file access subsystem. This gives the system a speed advantage over traditional file servers.



Access to smaller, named files

The volume access subsystem sees large blocks of data with no internal structure – for example, digital movie files that are delivered to users from beginning to end. The file access subsystem gives the system access to many smaller named files, such as the files that make up a Web site.

f1 f2 f3 f4 f5 . . . fn

Where the volume access subsystemsees one large volume...

...the file access subsystem might seea number of smaller named files.

Block search services for the volume access subsystem

To find files, the file system has several different directories:

• Inode directory – an inode is a system data structure that describes a file. An application references a file by giving the file system an inode number.

• URL directory – this is a table that maps URLs to inodes, in effect providing URL “names” for the inodes.

• Traditional file system directory – another inode mapping table, but one that mimics the hierarchical tree structure of subdirectories and files commonly used in PCs and UNIX machines. These directories also point to (and “name”) inodes.

There are three basic methods for reading content, depending on the nature of the files involved:

• Locate method – the client task wants to read from a certain offset into a volume, which it has a handle to. This method is for large files. Here is a typical sequence of events:

File systemsession

12

3 4

5

67

89

Volume systemsession

Storage node

Client

10

11



1: Open request (client sends either a file name or URL). 2: Open reply (returns a handle to the file, if found). 3: Locate request. 4: Locate reply (returns a map of the file’s block segments). 5: Volume read request. 6: Sector read request. 7: Data transfer to client buffer (an RDMA transfer). 8 & 9: Request replies. 10: Close request. 11: Close reply. Steps 5 through 9 repeat until the client has received the entire file.

• Whole file method – for quick access to files small enough to be fully retrieved in one read operation (such as Web site files). One benefit of this method is that there are no file open or close operations.

File systemsession

1

7

2

5

4 3

6


Storage node

Client

1: Read file request (client sends either a file name or URL). 2: File system session passes read request along. 3: Sector read request. 4: Data transfer to client buffer (an RDMA transfer). 5, 6, 7: Request replies.

• Traditional method (with Ikadega enhancements) – this method supports file reads as done on a UNIX system. The method also supports traditional file operations such as renaming, setting permissions, etc., and it supports DAFS.

File systemsession

1

24

56

3

8


Storage node

Client

9

10

11

7

1: File open request. 2:Open request reply. 3: File read request. 4: Read request passed along. 5: Sector read request. 6: Data transfer to client buffer (an RDMA transfer). 7, 8, 9: Request replies. 10: File close request. 11: File close reply.



The “Ikadega enhancements” mentioned above include the direct RDMA content transfer from the storage node to the client. Traditional file systems would send the content to the client through the volume and file system sessions.

For added flexibility, clients may shift between the locate and traditional methods with the same file handle.

Note: These drawings assume that the directories are fully cached in memory and that files are stored contiguously on disk.

The volume access subsystem

The volume access subsystem supports the file access subsystem and UNIX file system. Volumes are logical collections of sectors, often organized into categories of content stored on the system, such as the top N most popular titles and the other less-popular files. Most volumes contain large content files sized from the hundreds of megabytes to gigabytes and beyond. Volumes generally have fewer attributes than files – they do not have items such as access permission data, modification and access dates, checkpoint information, etc.

The volume access subsystem is where you first start to see disk organization. A disk has one or more disk slices, each of which contains partitions. Partitions cannot cross disk slice boundaries. There also is a partition descriptor for each partition in a slice.

Disks

Partitions Disk sliceboundaries

Partition descriptors

Disk slices help support the work of offline utility applications. One example of such an application is a program that pre-loads content before disks are shipped. The system formats disk slices like conventional operating system partitions.

Note to readers who are familiar with the system’s traffic shaping components: You can think of the volume access subsystem as a part of the components responsible for storage array control and fabric traffic.



Aggregation One method for splitting volumes into partitions is aggregation. This is simply breaking the content into partitions, which can reside on different disks or storage nodes.

Original volume Disk 2 Disk 8

100 gigabytes 40 GB 60 GB

With checkpointing or replication, the aggregation can have different boundaries for each version:

40 GB 60 GB 40 GB30 GB 30 GB



Striping Striping is another way the system splits content files into partitions. Striping is a disk storage technique that helps to protect against lost content. It splits up a content file into equal-length blocks called stripes. The system stores these stripes on N different storage nodes (here N = 4), along with an additional stripe described below:

01234

(^ = exclusive OR)

567

0 1

2 3

4 5

6 7

Storage node 1 Storage node 2

Storage node 3 Storage node 4

Original volume contents

0^1^2^3

4^5^6^7

Storage node 5

Parity stripe

Each byte in the parity stripe (at the bottom of the drawing) is the result of an exclusive OR logic operation on the bytes in the corresponding stripes. For example, the first byte of the parity stripe is the result of an exclusive OR performed on the first bytes of stripes 0 through 3. If the system can’t read one of the stripes (say if there is a disk or storage node error), it can re-create the lost data by comparing the values in the parity stripe with those in the remaining stripes.

The hardware layer

This layer contains the hard disks and their controlling hardware and software, all located on storage nodes. The layer doesn’t know the meaning of the data it reads and writes. It just responds to specific commands. Most of the work it does is read operations, but it does write to disk as well, to load new content or make copies of volumes.

Every storage node has multiple sub-nodes (two of them at present), each of which controls one ATA-type hard disk drive. The sub-nodes have custom disk-controlling hardware as well as interfaces to the fabric. Since the whole DirectPath system is designed to keep the disk drives as busy as possible, the system makes heavy demands on the drives, and there is very little room for malfunctions or disk errors. Field service people can replace disks “on the fly” (while the rest of the system continues to run and deliver content) to remove a faulty disk, install a drive with more capacity, or insert a disk pre-filled with new content.



To assist the disk activity scheduler, DirectPath maintains a set of performance history data on each disk. This data reflects the actual performance of each individual drive (rather than the specifications for the drive type).

Checkpoints Checkpoints allow the system to keep multiple file versions on disk, mostly to ensure read consistency – each end user getting files from the same file set. This is useful in many applications, especially with frequently updated files such as Web files. If a site is popular, there might be a number of people using it when it’s time for one of the file updates Web sites often have. If a site changes frequently, at some point there will be users with files open from several revisions ago, especially for users with slow Internet connections. With checkpointing, the new and older versions of the site co-exist while the system loads new files to the disk drives.

The checkpoint feature is implemented in the volume access subsystem, though it would only be used on the files processed by the file access subsystem.

The files for a new checkpoint become available to users when the file system commits them. A commit operation updates the partition descriptors for the volume. New users aren’t able to use the new checkpoint until all of the new files are committed successfully. Users that had site files open at the start of the update see only files from the most current checkpoint when they started their sessions. If the system halts during the loading stage (before it can commit a new checkpoint), the checkpoint and its new content are lost. The system retains the previous committed checkpoints, though.

The DirectPath customer can specify how many checkpoints to keep for each volume. The system generally re-uses the storage space of expired versions. This can be a rapid process – on some systems that change content very quickly, the resources for a replaced checkpoint may be re-used in as little as 4 minutes.

The checkpoint feature is implemented in the volume access subsystem, but DirectPath only uses it on the smaller named files processed by the file access subsystem. At any given moment, a checkpointed volume has files from 0 to n checkpoints available, and it may also have a new checkpoint in progress.

A simple example To understand checkpointing, take an example volume that only has five files. (The example is small and unrealistic, but it demonstrates the basics of how checkpointing works.) Suppose there is a DirectPath system that hosts Web sites, and it receives five files for the initial version of a site. The files are a.html, b.html, c.html, d.gif, and e.gif.



When the new files arrive, if they are to go in a new volume, the application software allocates a certain amount of space for the volume. At this point there is officially nothing in the volume – none of the blocks is committed, though the file system may have started loading the files to disk.

.

.

.

Data being storedto disk

At this stage, the files for the example Web site are present on their way to disk, but they are not yet available to users.

At this stage, the files for the Web site are present on their way to disk, but they are not yet available to users.

When the files are stored successfully, the file system can commit the checkpoint. After it does this, new users open the files from this first checkpoint:

a.html (1..latest)b.html (1..latest)c.html (1..latest)

d.gif (1..latest)

e.gif (1..latest)

...

Most recently committed: 1Oldest retained checkpoint: 1

The (m..n) notation indicates the checkpoints each file belongs to. (The system does not store this information with the files, however – it maintains it in memory only.) Oldest retained checkpoint and Most recently committed are two variables the system uses to keep track of a volume’s checkpoints.



Now, suppose sometime later there is a change to the a.html file, where a completely new version of the file replaces the first version. The system stores the new version in the first available free space, and then commits it:

a.html (1)b.html (1..latest)c.html (1..latest)

d.gif (1..latest)

e.gif (1..latest)

a.html (2..latest)...

Invalidated

New

Most recently committed: 2Oldest retained: 1

At this point, the volume contains files from checkpoints 1 and 2. The commit invalidates the first version of a.html, which means that the file is still there but is no longer in the latest checkpoint. However, the file is still valid for sessions using checkpoint 1, if any. If certain conditions are met later (see below), the file system could eventually re-allocate the storage used by this first version of a.html. The users are not aware of the checkpointing or the different file versions.



Suppose now that there two file changes for the next checkpoint. The site owner changes the b.html file, which had been the only file referencing d.gif. The new b.html no longer uses the graphic file. In addition to invalidating the old b.html, the system invalidates d.gif – the file does not apply to the new checkpoint or the ones that follow it (unless one of the HTML files is changed to refer to it again). d.gif and the original b.html are still valid for users of the checkpoints 1 and 2. Here’s what the volume looks like after the commit:

a.html (1)b.html (1..2)

c.html (1..latest)

d.gif (1..2)

e.gif (1..latest)

a.html (2..latest)b.html (3..latest)

...

Most recently committed: 3Oldest retained: 1

What eventually happens to the previous version of b.html, d.gif, and other invalidated files depends on the customer’s file allocation policies. The system’s operators probably want to keep files from at least some of the previous checkpoints, in which case these files would remain there unchanged. However, to keep disk clutter down, most customers also want to limit the number of checkpoints remaining on disk. So if a customer chooses a checkpoint limit, it affects what the file system does when there is a new checkpoint. If the following conditions are both true, then the file system could mark a file reclaimable (discarded and available for re-use by new files):

• If there are currently no sessions using the checkpoint in question, and

• If the file’s checkpoint number is older than the new checkpoint’s number minus the checkpoint limit (for example, with a limit of 5 and committing checkpoint 8, if a file is from checkpoints 3, 2, or 1)

The file system marks a file as reclaimable only if both conditions are true for it. If the file system marks a file as reclaimable, its blocks no longer contain valid data, though the system might not re-use them right away.



Checkpointing states and transitions

This document describes some of the data the DirectPath file system uses to process checkpoints. It also shows how these variables change in response to various checkpoint events.

Note: The information in this document applies only to customer environments with relatively small numbers of content providers. This document does not apply to environments where there are numerous content providers.

The file system maintains the following variables to support checkpointing, which are maintained by the volume access subsystem and file access subsystem:

• For each volume:

o OldestRetained – this is the number of the earliest checkpoint the system must still honor (retain the files for). Recall from About checkpoints that some end users could still be using files from one or more previous checkpoints.

o LatestCommitted – the number of the most recently committed checkpoint for a volume.

o OldestBeingUsed – the oldest checkpoint that still has active user sessions.

• For each file:

o Modified – this is the number of the checkpoint containing the latest version of a particular file.

o Invalidated – the checkpoint when the file was removed from the latest checkpoint, though it still may be in use by active user sessions.

Note: The DirectPath file system currently tracks these two variables for each file. It could in the future track them by block instead.

Checkpointing states

The following table shows the different states a file can be in. Note, though, that the DPFS does not store these file states anywhere. The state of each file is implied by the values of the above variables. The reason for this is system performance and consistency – if a checkpoint that affects, say, 200 files aborts before committing, the system would be slowed down by first marking and then un-marking all 200 files. Also, if the system halts in the middle of a 200-file update, the file system would not be able to correctly tell which files are members of each checkpoint when it comes back up.



There are abbreviations in the tables: F for file and V for volume. For example, F.Modified means the Modified variable for a file, and V.Committed is the Committed value for a volume.

Implied state Values causing that state Comment

Free F.Invalidated <= V.OldestRetained When the DirectPath file system wants to create new files for a checkpoint, it first allocates free space for them.

Newly allocated

F.Modified > V.LatestCommitted; F.Invalidated == NULL

This is the status of a new file being created for a future (not yet committed) checkpoint. If the file system commits the checkpoint, the file’s state becomes In-use retained. If the system instead aborts the checkpoint, it makes the file’s resources free again for re-use.

In-use retained

F.Modified <= V.LatestCommitted; F.Invalidated == NULL

This is a normal file state – it means that the file is part of the volume’s most recently committed checkpoint.

Invalidation pending

F.Invalidated > V.LatestCommitted In this state, the file system is in the process of creating a checkpoint that, when committed, will invalidate the file. If the commit operation completes, the status changes to Retained. If the system does not commit the new checkpoint, it eventually returns the file to the In-use retained state (sometime before it commits the next checkpoint).

Retained F.Invalidated > V.OldestRetained; F.Invalidated <= V.LatestCommitted

A file in this state has been invalidated. It is no longer in the volume’s latest checkpoint, but it is still part of an older checkpoint that has current users. When all of this checkpoint’s users end their sessions, the system could delete the file (put it in the Free state, in a sense) and re-use its resources, depending on its checkpoint retention policy.

State transitions

This drawing shows the checkpoint states a typical file goes through during its life span:

V.OldestRetained orIs F.Invalidated < the smaller of: ?

Newlyallocated

In-useretained

commit Newlyinvalidated

commit

Too old to

retain?RetainedNo

Yes

Free

commit

V.OldestBeingUsed



The following table shows how the system updates variables and changes implied states for individual files in response to miscellaneous checkpoint events.

This table describes how the system updates variables and changes states in response to different checkpoint events.

From state To state Triggering event Variables modified

Free Newly allocated File allocation F.Modified := V.LatestCommitted + 1

Newly allocated

Free File delete F.Invalidated := V.LatestCommitted + 1

Newly allocated

Free Checkpoint abort F.Modified := NULL

Newly allocated

In-use retained Commit Increment V.LatestCommitted

In-use retained Retained Commit, when V.OldestRetained is still < F.Invalidated

Increment V.LatestCommitted

In-use retained Free Commit, when V.OldestRetained is now >= F.Invalidated


Retained Free Commit, when V.OldestRetained is now >= F.Invalidated


Invalidation pending

In-use retained Checkpoint abort F.Invalidated := NULL

Replication Replication is a disk storage method for protecting against lost content by making complete copies of volumes on different storage nodes. This is useful when one copy of a content file is not enough to meet the peak demand for the file. It also improves the system’s peak throughput. System managers can use replication to place copies of popular content near the physical outer edges of disk drives, where data is read faster (since more bytes pass under the read/write head in the same amount of time than near the middle). Replication strategy is how a customer wants to replicate volumes.

Replication of content is always a relatively low-priority activity, not important enough to interfere with sending out content. As a result, there is lagged replication – the system usually finishes the copy operation somewhat after finishing the storage of the original content.



Replication and checkpointing

On systems using checkpointing, lagged replication follows behind the commitment of each checkpoint. For example, when the system originally allocates space for a volume, the replicated copy doesn’t exist yet:

Original volume Copy

As the original volume grows and changes, the copy might lag behind like this:

CPs included: 1

CPs: 1, 2 CPs: 1

CPs: 1, 2, 3 CPs: 1, 2

At this point, if it needed data from checkpoints 1 or 2, the file system could get it from either the original or the copy (assuming that the file in question has been invalidated in either of these checkpoints).



Content transfer engine subsystem

A content transfer engine is a platform that supports a number of content transfer daemons (CTDs). In the initial implementation, it works with CTDs in the Internet delivery subsystem. Together the two subsystems send data to Internet users.

A supplemental processor initializes and oversees the CTE. Sometimes these two components are on the same access node:

IP access node

Content transferengine

Supplementalprocessor

FabricInternet

Here the supplemental processor is on a separate supplemental processor node:

Supplementalprocessor

Supplementalprocessor node

Content transferengine

IP access node

Fabric

Internet

The CTE is implemented as a set of fixed-function engines and re-programmable micro-engines, all on a field-programmable gate array (FPGA). The engine’s major functions are buffer management and two communication interfaces going to the fabric and the Internet.



Inside the CTE This drawing shows the CTE’s internal structure:

Node-fabricinterfaceengine

EventQEventQ

Event engine

Externalnetworkinterfaceengine

EventQ

FabricExternalnetwork

Buffer &memorycontrol

External RAM

Content Transfer Engine

There are multiple event engines – more than one, but perhaps only a few. Each event engine has its own queue. The event engines receive notices about events to be processed by particular CTDs. The event engine dispatches an event by waking up the correct CTD to process it:

EventQ n

Event engine n

EventQ n-1

Event enginen-1

. . .

Content Transfer Engine subsystemInternet delivery

subsystem

CTDCTDCTD

CTDDispatch signal

The CTE assigns an arriving event to an event engine in this way:

• If the CTE currently has no other pending events for the particular CTD, the event goes to a randomly chosen event engine.

• However, if the event queues already contain at least one event for that CTD, then all events for that CTD must go through the same event engine, until they are all dispatched.

One of the critical issues for the content transfer engine is memory management. The CTE makes a minimum of memory transfers as it receives data from storage nodes and sends it to the external network. It does this by initially storing data from the access nodes into external RAM buffers, then sending it to the Internet from those same buffers.

A supplemental processor runs a component called the content transfer engine extension (CTEX). CTEX runs extended content transfer daemons (XCTDs), which extend the



content processing functions beyond the CTE’s limited processing scope. For example, XCTDs have more flexible access to shared data than CTDs do.

WatsonBruce Ikadega sample 1

Documents

Transcript of WatsonBruce Ikadega sample 1