Sitola · ACKNOWLEDGMENTS I would like to express my gratefulness to prof. Ludek Matyska and dr....

117
Network and Grid Support for Multimedia Distribution and Processing Petr Holub A thesis submitted for the degree of Doctor of Philosophy at The Faculty of Informatics, Masaryk University Brno May 2005 Brno, Czech Republic

Transcript of Sitola · ACKNOWLEDGMENTS I would like to express my gratefulness to prof. Ludek Matyska and dr....

  • Network and Grid Supportfor Multimedia

    Distribution and Processing

    Petr Holub

    ��� �� � ���

    � �������������������� �! "# $

    % & ')(+* ,.-0/�132 46587.9;:=< >@?BADCE FG HI

    A thesis submitted for the degree ofDoctor of Philosophy at

    The Faculty of Informatics, Masaryk University Brno

    May 2005Brno, Czech Republic

  • Except where otherwise indicated, this thesis is my own original work.

    Petr HolubBrno, May 2005

  • ACKNOWLEDGMENTS

    I would like to express my gratefulness to prof. Luděk Matyska and dr. EvaHladká for supporting me in my work a motivating me. I would also liketo thank the fellows at the Laboratory of Advances Networking Technologies:Lukáš Hejtmánek, Tomáš Rebok, Miloš Liška, Jiří Denemark, and others forcreating a great team which I really appreciate to work with. Furthermore, I’dlike to thank my parents and my grandparents and especially my grandfatherMiloslav, who spent huge amount of time with me and being an excellent pro-fessor. He has taught me how to love languages and mathematics and howthese fields are closely interrelated. And last but not least, I’d like to appreciatemy wife Aleška, who has been helping me immensely in the recent years andsoothing and encouraging me in the moments when I was feeling really down.

    P. H.

  • Abstract

    In this thesis, we focus our work on two classes of multimedia data distribution and pro-cessing problems: synchronous or interactive, which require as low latency as possible, andasynchronous or non-interactive, where latency is not so restrictive.

    In Part I, we study scalable and user-empowered infrastructure for synchronous datadistribution and processing. We propose a concept of a generalized user-empowered mod-ular reflector called Active Element (AE). It supports both running as an element of a user-empowered distribution network suitable for larger groups and also distributing the AEitself over tightly-coupled computer clusters. While the networks of AEs are aimed atscalability with respect to number of clients connected, the distributed AE is designed tobe scalable with respect to bandwidth of individual data stream. We have also demon-strated both medium-bandwidth pilot applications suitable for both AE networks andhigh-bandwidth applications for distributed AEs.

    For AE networks, we have analyzed a number of distribution models suitable for syn-chronous data distribution, ranging from simple 2D full-mesh models through multiplespanning trees. All the models were evaluated both in terms of scalability and also interms of robustness of operation with respect to AE failure and network disintegration.

    The most serious problem of distributed AE, where data is split over multiple equiva-lent AE units running in parallel, is packet reordering. We have designed and evaluatedFast Circulating Token protocol, which provides limited synchronization among egresssending modules of parallel paths in a distributed AE. While even the distributed AE withno explicit sending synchronization provides limited reordering, we have shown both the-oretically and experimentally that FCT improves output packet reordering.

    Part II presents our approach to distributed asynchronous multimedia processing. Wehave designed efficient model for distributing asynchronous processing that is capableof very complex processing in real-time or faster, depending on degree of parallelism in-volved. The model is based on creating jobs with uniform size for parallel computingnodes without shared memory, as available in Grid environments. It uses distributed stor-age infrastructure as transient storage for source and possibly also target data. We haveanalyzed scheduling in such environment and found out that our problem with uniformjobs and non-uniform processors belongs to PO-class. When the distributed storage isconnected with computing infrastructure via complete graph, the problem of schedulingtasks to storage depots belongs to the same class and thus scheduling as a whole is PO-class problem.

    In order to experimentally evaluate these models, a prototype implementation calledDistributed Encoding Environment (DEE) has been implemented based on Internet Back-plane Protocol distributed storage infrastructure. The prototype confirms expected be-havior and performance. DEE has become used routinely by its pilot applications, mostnotably processing of lecture archives recordings, which provides multi-terabyte archivesof video material for educational purposes.

  • Contents

    Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Definitions and Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

    1 Introduction 1

    I Synchronous Distributed Processing 3

    2 Objectives of Synchronous Distributed Processing 42.1 Distribution of Processing Load . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Distribution of Network Load . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3 State of the Art 73.1 Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 User-Empowered Modular Reflector . . . . . . . . . . . . . . . . . . . . . . . 8

    3.2.1 Architecture of the Reflector . . . . . . . . . . . . . . . . . . . . . . . . 83.2.2 Usage of the Reflector . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.3 Resilient Overlay Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4 Multimedia Processing and Distribution Systems Based on Overlay

    Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4.1 Virtual Room Videoconferencing System (VRVS) . . . . . . . . . . . . 113.4.2 Access Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4.3 H.323 Videoconferences . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.5 Use of Clusters as Distributed Routers . . . . . . . . . . . . . . . . . . . . . . 133.6 Use of Clusters as Distributed Servers . . . . . . . . . . . . . . . . . . . . . . 133.7 Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.8 OptIPuter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4 Networks of Active Elements 154.1 Synchronous Multimedia Distribution Networks . . . . . . . . . . . . . . . . 154.2 Active Element with Network Management Capabilities . . . . . . . . . . . 16

    4.2.1 Organization of AE Networks . . . . . . . . . . . . . . . . . . . . . . . 174.2.2 Re-balancing and Fail-Over Operations . . . . . . . . . . . . . . . . . 17

    4.3 Distribution Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3.1 2D Full Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3.2 3D Layered-Mesh Network . . . . . . . . . . . . . . . . . . . . . . . . 224.3.3 3D Layered Mesh of AEs with Intermediate AEs . . . . . . . . . . . . 244.3.4 Multicast Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    4.4 Content Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

  • CONTENTS vi

    5 Distributed Active Element 295.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2 Operation in Static Environment . . . . . . . . . . . . . . . . . . . . . . . . . 32

    5.2.1 Ingress Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2.2 Egress Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    5.3 Operation in Dynamic Environment . . . . . . . . . . . . . . . . . . . . . . . 385.3.1 Setup of a New Ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.3.2 Addition of a New Node . . . . . . . . . . . . . . . . . . . . . . . . . 385.3.3 Failure Detection and Recovery . . . . . . . . . . . . . . . . . . . . . . 385.3.4 Removal of Existing Node . . . . . . . . . . . . . . . . . . . . . . . . . 395.3.5 Communication between Distributed AE and Load Balancers . . . . 39

    5.4 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 415.4.3 Packet Loss and Reordering Evaluation . . . . . . . . . . . . . . . . . 44

    6 Pilot Applications for Synchronous Processing 486.1 DV over IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.2 HDV over IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.3 Uncompressed HD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    II Asynchronous Distributed Processing 53

    7 Introduction to Asynchronous Distributed Processing 547.1 Objectives of Asynchronous Distributed Processing . . . . . . . . . . . . . . 547.2 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    7.2.1 Grid and Distributed Storage Infrastructure . . . . . . . . . . . . . . 557.2.2 Video Processing Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.2.3 Automated Distributed Video Processing . . . . . . . . . . . . . . . . 55

    8 Distributed Encoding Environment 568.1 Model of Distributed Processing . . . . . . . . . . . . . . . . . . . . . . . . . 56

    8.1.1 Conventions Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578.2 Scheduling algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    8.2.1 Use Cases and Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 598.2.2 Components of the Model . . . . . . . . . . . . . . . . . . . . . . . . . 608.2.3 Processor scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648.2.4 Storage scheduling problem, 1–to–1 model . . . . . . . . . . . . . . . 658.2.5 Storage scheduling problem, 1–to–n model . . . . . . . . . . . . . . . 66

    8.3 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678.3.1 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 678.3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688.3.3 Access to IBP Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 708.3.4 Scheduling Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718.3.5 Distributed Encoding Environment . . . . . . . . . . . . . . . . . . . 728.3.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    9 Pilot Applications for Asynchronous Processing 77

  • CONTENTS vii

    10 Conclusions 7910.1 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    10.1.1 Synchronous processing . . . . . . . . . . . . . . . . . . . . . . . . . . 7910.1.2 Asynchronous processing . . . . . . . . . . . . . . . . . . . . . . . . . 81

    10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8110.2.1 Synchronous processing . . . . . . . . . . . . . . . . . . . . . . . . . . 8110.2.2 Distributed Encoding Environment . . . . . . . . . . . . . . . . . . . 82

    A Detailed Measurements Results 83

    Bibliography 94

  • List of Figures

    3.1 User-Empowered Modular Reflector Architecture . . . . . . . . . . . . . . . 9

    4.1 Architecture of the Active Element. . . . . . . . . . . . . . . . . . . . . . . . . 164.2 2D full mesh. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Flow analysis in full 2D mesh of AEs. . . . . . . . . . . . . . . . . . . . . . . 214.4 Behavior of 2D full mesh for DV clients. . . . . . . . . . . . . . . . . . . . . . 224.5 3D layered mesh. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.6 Number of AEs needed for 3D mesh with intermediate hops. . . . . . . . . . 264.7 Recovery time for with backup SPT (solid line) and without it (dashed line)

    simulated using cnet-based network simulator. . . . . . . . . . . . . . . . . 28

    5.1 Model infrastructure for implementing the distributed AE. . . . . . . . . . . 315.2 Model of the ideal distributed AE with ideal aggregation unit. . . . . . . . . 335.3 Sample load balancing packet distribution for distributed AE. . . . . . . . . 345.4 Fast Circulating Token algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 365.5 Alternative load balancing packet distribution. . . . . . . . . . . . . . . . . . 375.6 Distributed AE testbed setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.7 Forwarding performance of distributed AE without explicit synchroniza-

    tion for number of paths 1 through 6. . . . . . . . . . . . . . . . . . . . . . . . 425.8 Forwarding performance of distributed AE with synchronization using FCT

    for number of paths 2 through 6. . . . . . . . . . . . . . . . . . . . . . . . . . 43

    6.1 DV over IP based stereoscopic transmission. . . . . . . . . . . . . . . . . . . 496.2 MPEG-2 Transport Stream packet format according to IEC 61883. . . . . . . 506.3 Latencies limits for collaborative environments. . . . . . . . . . . . . . . . . 52

    8.1 Workflow in the Distributed Encoding Environment model of processingdistribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    8.2 Model of target infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 588.3 PS Algorithm: Greedy algorithm for processor scheduling . . . . . . . . . . . 648.4 1-DS Algorithm: 1–to–1 task scheduling . . . . . . . . . . . . . . . . . . . . . 668.5 n-TS Algorithm: 1–to–n task scheduling . . . . . . . . . . . . . . . . . . . . . 678.6 Distributed Encoding Environment architecture and components. . . . . . . 698.7 Simplified job scheduling algorithm with multiple storage depots per pro-

    cessor used for downloading (i. e. N–to–1 data transfer) and neglecting theuploading overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    8.8 Example DEE workflow for transcoding from video in DV to RealMediaformat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    8.9 Acceleration of DEE performance with respect to degree of parallelism. . . . 75

    9.1 Scheme of the lecture recording and processing workflow. . . . . . . . . . . 78

  • LIST OF FIGURES ix

    9.2 Interface to video lecture archives and example recording played from thestreaming server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    A.1 Execution profile of DEE using shared infrastructure. . . . . . . . . . . . . . 94A.2 Execution profile of DEE with no remuxing using dedicated infrastructure. 95A.3 Execution profile of DEE with remuxing using dedicated infrastructure. . . 96

  • List of Definitions and Theorems

    Definition 4.1 Simple distribution models . . . . . . . . . . . . . . . . . . . . . . . . 18Definition 4.2 2D full-mesh network . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Definition 4.3 Evenly populated AE network . . . . . . . . . . . . . . . . . . . . . . 20Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Definition 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Definition 4.5 3D layered-mesh network . . . . . . . . . . . . . . . . . . . . . . . . . 22Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Theorem 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Theorem 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Definition 4.6 q-nary distribution tree . . . . . . . . . . . . . . . . . . . . . . . . . . 24Definition 4.7 3D layered mesh with intermediate AEs . . . . . . . . . . . . . . . . . 24Definition 4.8 Intermediate AE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Theorem 4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Theorem 4.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Definition 5.1 Ideal network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Definition 5.2 Ideal multimedia traffic . . . . . . . . . . . . . . . . . . . . . . . . . . 32Definition 5.3 Ideal aggregating unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Definition 5.4 Ideal distributed AE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Definition 5.5 Ideal distribution unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Theorem 5.1 Maximum reordering with no explicit synchronization . . . . . . . . . 34Definition 5.6 Non-preemptive data packet sending . . . . . . . . . . . . . . . . . . 35Definition 5.7 Token handling priority . . . . . . . . . . . . . . . . . . . . . . . . . . 35Theorem 5.2 Maximum reordering with FCT synchronization . . . . . . . . . . . . 35Definition 5.8 Reordering graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Theorem 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Definition 8.1 Data transcoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Definition 8.2 Data prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Definition 8.3 Completion Time Estimate . . . . . . . . . . . . . . . . . . . . . . . . 60Definition 8.4 Network Traffic Prediction Service . . . . . . . . . . . . . . . . . . . . 60Theorem 8.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Theorem 8.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Theorem 8.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Theorem 8.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

  • List of Abbreviations

    AAA Authorization, Authentication,Accounting

    ACL Access Control ListAE Active ElementAFS Andrew File SystemAG AccessGridAPI Application InterfaceAVI Audio Video Interleave; an envelope

    video/audio formatCARP Common Address Redundancy ProtocolCERN Centre Européen pour la Recherche

    NucleaireCPU Central Processor UnitCTE Completion Time EstimateDEE Distributed Encoding EnvironmentDV Digital VideoDVTS Digital Video Transport SystemEGEE Enabling Grids for E-sciencE; European

    Grid projectFCT Fast Circulating Token protocolFIFO First In First OutFPGA Field Programmable Gate ArrayGbps Gigabit(s) per second; Gb.s−1

    GE Gigabit EthernetGM the low-level message-passing system for

    Myrinet networksGSI Grid Security InfrastructureHD High-Definition (video)HDV SONY format for compressing HD videoHTTP HyperText Transfer ProtocolIBP Internet Backplane ProtocolIP Internet ProtocolITU International Telecommunication UnionITU-T ITU Telecommunication Standardization

    SectorLAN Local Area NetworkMBone Multicast BackboneMbps Megabit(s) per second; Mb.s−1

    MCU Multi-point Connection UnitMOV QuickTime envelope video/audio formatMPEG Moving Picture Experts Group

    MPEG-2 MPEG video compression formatMPEG-4 MPEG video compression formatMSB Multi-Session BridgeMTU Maximum Transmission UnitNFS Network File System protocolNIS Network Information ServiceNM Network ManagementNTPS Network Traffic Prediction ServiceNTSC National Television System Committee;

    a TV systemOWD One-Way DelayP2P Peer-to-PeerPAL Phase Alternating Line; a TV systemPBS Portable Batch Scheduling systemPBSPro Portable Batch Scheduling system –

    commercial Professional versionPVM Parallel Virtual MachineQoS Quality of ServiceRAP Reflector Administration ProtocolRFC Request For CommentsRM RealMedia video/audio formatRON Resilient Overlay NetworkRTCP Real Time Transport Control ProtocolRTP Real-Time ProtocolRTPv2 Real-Time Protocol version 2RTT Round-Trip TimeSDI Serial Digital InterfaceSMTPE Society of Motion Picture and

    Television EngineersTCP Transmission Control ProtocolTTL Time To LiveUDP User Datagram ProtocolURI Unified Resource IdentifierURL Unified Resource LocatorVRRP Virtual Router Redundancy ProtocolVRVS Virtual Room Videoconferencing SystemVRVS Virtual Room Videoconferencing SystemVLC VideoLAN ClientXML eXtensible Markup Language3D Three-dimensional

  • Chapter 1

    Introduction

    Current academic Internet environment has enabled fast transfers of huge amounts ofdata, making high quality multimedia and collaborative applications a reality. Both col-laborative and multimedia applications involve processing of specific data with specialrequirements on distribution and delivery. However, the processing itself often needs tobecome distributed as the required vast amount of network traffic and processing capac-ity can easily overload any existing commodity centralized solution. Another reason forcreating distributed solution is improvement in terms of robustness and fault tolerance.

    The problem of distributed multimedia processing can be divided into two classes ofproblems: synchronous (on-line or interactive) processing and asynchronous (off-line or non-interactive) processing. Though these two classes might ultimately converge, they seem tohave their own distinct problems and goals. The problem of synchronous data processingaims at processing high data volumes with as low latency as possible and thus the amountof processing is limited by the latency requirements. The asynchronous processing has nosuch strict demands on latency and it can involve more complex processing, too. How-ever, the overall speed and scalability of the processing is of utmost importance. Anotherproblem of non-interactive asynchronous data processing and distribution is availabilityof transient storage of enough size and speed that can be accessed by acquisition tools,processing tools, and possibly also by client tools and/or distribution servers for later re-play.

    Both synchronous and asynchronous environments that we have targeted to create inthis work need to be experimentally evaluated on some pilot client applications. We focusour attention mainly on high-quality video transmission which generates high volumes ofdata at high rates (e. g., uncompressed high-definition (HD) video consumes as much asapproximately 1.5 Gbps) and can be used in both synchronous and asynchronous modes.

    This work is organized into two parts: in Part I, we discuss problems and proposedsolutions for synchronous distribution environments, while Part II is deals with asyn-chronous infrastructure. Organization of parts is outlined in respective introduction chap-ter of each part and the results and future work for both parts are summarized in conclud-ing Chapter 10.

    ClaimsAuthor of this thesis claims the following contributions to the state of art of network andGrid support for multimedia processing and distribution:

    • Proposal of the Active Element (AE) architecture designed as user-empowered dis-tributed network element for multimedia distribution and processing.

  • 2

    • Design and analysis of several robust models for synchronous data distribution scal-able with respect to number of connected clients.

    • Proposal of distributed active element capable of processing streams with bandwidthhigher than the capacity of each individual AE.

    • Load sharing distributed AE without explicit synchronization is analyzed and shownto provide bounded reordering.

    • Load sharing distributed AE with synchronized sending using Fast Circulating To-ken (FCT) protocol is designed, analyzed and shown to provide bounded reorderingsuperior to sending without explicit synchronization.

    • Behavior of the both load sharing distributed AE with FCT and without it has beenexperimentally studied using prototype implementation based on state of the artcomputing clusters with low-latency interconnects.

    • Distributed encoding environment workflow whose parallel phase scales linearlywith respect to number of nodes involved in processing until task-specific maximumnumber of nodes is reached.

    • Analysis of task scheduling for distributed encoding environment with respect todistributed processors and distributed data storage. The scheduling algorithm be-longs to PO class.

    • A number of pilot applications using prototype implementations of the both syn-chronous and asynchronous distribution and processing models.

  • Part I

    Synchronous DistributedProcessing

  • Chapter 2

    Objectives of SynchronousDistributed Processing

    The problem of synchronous (on-line or interactive) processing lies in providing environ-ment with as low latency of processing and distribution as possible. For example, humanperception of sound registers latency as low as 100 ms in general communication whilelatency requirements of visual perception is not that strict. However, for truly natural col-laborative environment the video must be rather precisely synchronized with audio (socalled “lip synchronization”) so that the video latency is of the same importance. Further-more, when haptic interaction (force feedback) is incorporated, the latency and especiallyjitter requirements become tighter in order of several magnitudes1.

    Typical applications for synchronous distributed processing are high quality video-conferencing and remote collaborative applications. These might include for instancetransmission of stereoscopic (3D) video to provide natural perception of communicatingpartner, transmission of 3D models visualization, transmission and processing of uncom-pressed video and audio to minimize latency and many others.

    A distributed environment for processing high volumes of data at high rate requiresdistribution of

    • processing load—to be able to process amounts of data that are impossible to processvia any commodity computer today,

    • network load—to avoid bottlenecks formed by networking interface of single process-ing computer.

    Network load distribution has two aspects, which are handled separately in this work:first, a distribution that boosts scalability with respect to number of clients served by theinfrastructure and second, distributed scheme that allows processing of high-bandwidthdata streams that are beyond capacity, which can be handled by any single processingnode.

    Distribution also allows forming overlay networks that can be used to provide fault-tolerant behavior as shown later in this chapter. We also seek for more general frameworkfor distribution of processing and transportation of multimedia data and possibilities itcan bring.

    Our work deals with employing commodity PC clusters interconnected with low-laten-cy interconnection (so called tightly coupled clusters) to perform distributed processing ofmultimedia data and distributing them to clients. Higher-level “clusters” of the tightly

    1Commonly understood threshold for natural perception of haptic interaction is 5,000 Hz and thus precisionof timing or jitter of much less than 200 µs is needed.

  • 2.1. DISTRIBUTION OF PROCESSING LOAD 5

    coupled clusters or separate computers interconnected via network links with higher la-tency (and thus possibly distributed across wide area) will be used to create overlay net-works. Such clusters are often available as a part of Grid high performance computingenvironment. On transport level, standard IP network with common commodity equip-ment interconnecting processing nodes with clients is expected.

    2.1 Distribution of Processing LoadThe distribution of the processing load is efficient only when computationally intensiveprocessing is required. Additionally, needed network capacity must be substantially lowerthan the capacity of commodity PC internal interconnects and buses. The following appli-cations are possible examples suitable for distribution of load processing only [32]:

    • Multimedia stream transcoding—Multimedia transcoding is conversion from sourceto target format. For example video transcoding often involves multiple discretecosine transforms and complicated matrix transforms, and other computationallyintensive operations. Although it is possible to perform distribution of compres-sion/decompression algorithm for some formats, it is often more effective to dis-tribute processing on either per frame2 or per client basis. This capability is usefulfor example when some client application needs multimedia data in other formatthan rest of the collaborating group.

    • Video de-interlacing—Another example is when video image is captured and trans-mitted in interlaced mode3 and client applications need progressive video to displayimage correctly. So called de-interlacing process is computationally very demandingif high-quality output is requested, as it requires decompression of two successivefields, blending odd and even lines together, and finally compression of resultingimage into target format. The distribution can be on per-frame basis. It is near toimpossible to do it using current commodity PCs in real time without additionalhardware support especially for HD video. It makes sense to perform de-interlacingcentrally as almost none of client applications displays its output on interlaced dis-play and the video producers are unable to transform it on their own because ofexcessive computational demands. This transformation also doesn’t change band-width needed and thus if the original stream was processable by network subsystemof commodity PC, the resulting stream remains processable as well.

    • Multimedia stream composition—The stream composition means either merging sev-eral streams together (for sound) or arranging several down-sized (down-sampled)images into one frame [38]. This involves decompression and down-sizing of manystreams in parallel and thus one computer may not be sufficient. The decompressionand down-sizing phase can be efficiently performed on many nodes in parallel. Thiscapability is useful for example if processing power at client sites is insufficient todecode and play large number of simultaneous streams in parallel.

    The distribution of the processing load is not very hard from theoretical point of viewwhen modular user-empowered reflector is used—distribution of processors can be han-dled efficiently, e. g. using message passing like MPI over Myrinet. Thus being morequestion of development and implementation, it is not the focus of this work.

    2For low latency video transmission, video compression algorithms producing independent frames or evenindependent blocks are used to minimize latency needed to compress and decompress the image. This approachalso limits impact of data loss in the network.

    3Interlaced video means the odd lines are displayed first followed by even lines (or vice versa) to allow dou-bling of display frequency in order to achieve smooth perception of movement in the image. One video frameis split into two fields, one of them containing odd lines and the other even lines. This is common approachemployed in most of video hardware devices as cameras and TV sets. Progressive video displays whole imageat once and it is the way how computers typically display their output.

  • 2.2. DISTRIBUTION OF NETWORK LOAD 6

    2.2 Distribution of Network LoadThe distribution of the processing load only is not suitable for many multimedia appli-cations. Commodity PCs are usually equipped with network interfaces ranging from the100 Mbps Fast Ethernet to 2× 1 Gbps Gigabit Ethernet, all of them usually in full-duplexmode. For Myrinet low-latency interconnection the available capacity is 2 Gbps in full-duplex mode. However when a PC is used to handle 1 Gbps traffic, one processor getsusually saturated by servicing the network interface card and the second processor is re-quired to compute real data transformations. Further internal architecture of most com-monly used IA32 PC architecture with PCI busses is easy to saturate when working withmulti-gigabit data flows. When significantly more network bandwidth is needed to beprocessed, the network load has to be distributed over multiple hosts.

    A switch interconnecting a cluster has almost always higher switching capacity thanany of the computers connected and thus the switch is not the network bottleneck forserial processing by any single node. Again the minimalist target scenario is to create adistributed environment with higher maximum processable throughput than it is possiblewith any single computer in the cluster.

    When several machines are used for sending data that are part of single stream it isnecessary to design and implement some synchronized sending architecture to avoid largepacket reordering with negative impact on most of applications. In Chapter 5 we show thateven without explicit synchronization, the packet reordering is upper-bound limited andwe also propose and evaluate a protocol to further reduce the reordering.

    Examples of possible scenarios that can utilize both distribution of processing and net-work load are shown below:

    • Processing many streams with medium bandwidth requirements – When standard defini-tion DV video and audio is used by 10 collaborating clients, each of them is sendingone stream and receiving streams from all other clients resulting in sending 30 Mbpsper client and receiving 270 Mbps per client—this can be handled by one PC withoutserious problems. The total bandwidth required at processing site is 300 Mbps for re-ceiving and 2.7 Gbps for sending which is clearly beyond what is possible to processon single commodity PC.

    • Processing one high-bandwidth stream – High bandwidth stream can be for examplea single uncompressed HD stream (1.5 Gbps) or high resolution visualization withvirtually unlimited bandwidth utilization. It might be needed e. g. to distribute suchstreams to multiple clients or to down-sample such stream to make it accessible forclients with lower bandwidth connectivity.

    2.3 Fault ToleranceFailures of links in wide area networks are likely to occur quite frequently as shown in [2]even when network as a whole is rather stable. Failures can also happen inside the dis-tributed processing environment, e. g. when cluster of processing nodes crashes or be-comes unavailable for any other reason. Therefore an environment designed to supportsynchronous communication of clients distributed across wide area networks should at-tempt to mitigate perception of these problems by clients.

    Especially when it comes to multimedia distribution using native multicast, the distri-bution scheme is over-optimized from fault tolerance point of view. Therefore as a part ofnetwork load distribution models scalable with respect of number of clients connected, westudy different distribution schemes with varying scalability and robustness ratios.

  • Chapter 3

    State of the Art

    There is a number of systems available for synchronous distribution of multimedia data innon-distributed and fashion and some of them even have limited processing capabilities.Most of these system are also designed without user-empowered paradigm in mind. Ourapproach is different in the distribution of processing either over network of elements inthe network or over a tightly coupled cluster of computers to allow distribution of bothhigh number of streams (clients) and high bandwidth streams while also allowing moredemanding processing. Therefore our work is related not only to multimedia processing,but also to projects utilizing PC clusters as distributed routers and servers. We also re-gard robustness as an important cornerstone of larger distributed systems and proposedistribution models crafted with both scalability and robustness in mind.

    In this section we give a brief overview of related systems. We give a short overview ofuser-empowered modular reflector, that is a predecessor of our active elements describedin Chapters 4 and 5. We emphasize systems that create overlay networks for increasingrobustness instead of relying on robustness of underlying networks. We also describeshort status quo of peer to peer network architectures which we envision as important dueto their fault tolerance support. We conclude this section with OptIPputer description,which we see as interesting holistic vision of distributed collaborative platform for futurethat comprises all the levels from optical network to end user applications.

    3.1 MulticastMulticast scheme has been designed for unidirectional data transmissions to reach anysubset of nodes in the network while sending the data over any link at most once [66]. Itinvolves no additional data processing except for simple data replication where appropri-ate and thus any user-specific data handling (processing, QoS, etc.) is impossible.

    The multicast is the “natural” solution for the synchronous data distribution as it in-volves data multiplication directly in the network so that the same data are transferredat most once over any particular link. Actually it was associated with multimedia trans-mission from the beginning as no lines of sufficient capacity were available for multimediadistribution around year 1985, when first prototypes of multicast in IP networks appeared.While this approach implies large (“infinity”) scalability, it imposes non-trivial require-ments on the network as all the network nodes must support it in a consistent way.

    MBONE (Multicast backBONE) network was created by Steve Deering at the beginningof multicast history as an overlay over underlying unicast network. The multicast net-works were connected using tunnels created by mrouted daemons [82]. Each mroutedconnects to other mrouted daemons and creates tunnels that deliver the multicast traf-fic. The tunnel is used to encapsulate multicast packets into datagrams, that are in turn

  • 3.2. USER-EMPOWERED MODULAR REFLECTOR 8

    sent through the unicast network. Routing among mrouted daemons is performed usingDistance-Vector Multicast Routing Protocol (DVMRP).

    Nowadays, more advanced protocols like Protocol Independent Multicast [18] (eitherin so called Sparse Mode [21] or Dense Mode [1]) are deployed, which are not supported inoriginal mrouted software. The prevailing multicast routing protocol in current Internetis PIM–SM. These protocols improve behavior of the original multicast protocols to someextent, however the basic problems of multicast as already discussed above (e. g. per clientQoS handling) remain the same. Therefore multimedia distribution uses either multicastsimulation as shown in subsequent section (idea of “virtual multicast” is also proposedin [32]) or uses hybrid approach where multicast-enabled clients are served via multicastand other clients using unicast distribution.

    Despite continuous effort only small fraction of places on Internet has reliable nativemulticast connectivity and users are left at disposal of administrators regarding their mul-ticast connectivity as multicast is not user-empowered. Another problem with multicastis its implementation directly on routers inside the network: this is great from efficiencypoint of view, but disaster if these routers are lacking strict separation of processes in itsinternal operating system. If any problem occurs with stability of routers due to multi-cast implementation—and there is good chance for this to happen as multicast is indeedvery complicated compared to unicast and poses much more load on the router as the datamust be also replicated—the router administrator simply cuts off multicast because he orshe can not afford to threaten unicast routing stability. Furthermore all the practically usedmulticast protocols have also other disadvantages: it is near to impossible to take care ofquality of service requirements for the whole multicast group, it is very difficult to providesecured environment without a shared key, and there is no easy support for accounting.

    3.2 User-Empowered Modular ReflectorThe problems of multicast may be overcome by multicast connectivity simulation or vir-tualization, where active nodes have a role of reflectors [32], that replicate all traffic passingthrough them in a controlled way. Another important property of the reflectors is that theycan be used for any content processing in many different mode—be it on per stream, pergroup, or per client basis.

    3.2.1 Architecture of the ReflectorThe design of a reflector must be flexible enough to allow implementation of requiredfeatures and leaving space for easy extensions for new features. This leads to a designthat is very similar to active router architecture [37] modified to work entirely within theuser space. Users without administrator privileges are thus able to run reflector on anymachine they have access to. The reflector architecture is shown in Fig. 3.1.

    Data Processing Architecture

    Data routing and processing part of the reflector comprises network listeners, shared memory,a packet classifier, a processor scheduler, number of processors, and a packet scheduler/sender.

    The network listeners are bound to one UDP port each. When a packet arrives to thelistener it places the packet into the shared memory and adds reference to a to-be-processedqueue. The packet classifier then reads the packets from that queue and determines a pathof the data through the processor modules. It also checks with routing AAA modulewhether the packet is allowed or not (in the later case it simply drops that packet andcreates event that may be logged). Zero-copy processing is used in all simple processors(packet filters), minimizing processing overhead (and thus packet delay). E. g. for sim-ple data multiplication, the data are only referenced multiple times in the packet sched-

  • 3.2. USER-EMPOWERED MODULAR REFLECTOR 9

    packetclassifier

    processorscheduler

    routingAAA

    sessionmanagement

    resource management

    Reflector Kernel

    administrative AAA

    management

    sessionmanagement

    packetprocessor

    Processor 1

    sessionmanagement

    packetprocessor

    Processor n

    shared

    memory

    networklistener 1

    networklistener n

    packetscheduler

    /sender

    messaginginterface 1

    messaginginterface n

    data flow control information

    FIGURE 3.1: User-Empowered Modular Reflector Architecture

    uler/sender queue before they are actually being sent. Only the more complex modulesmay require processing that is impossible without use of packet copies.

    The session management module follows the processors and fills the distribution list ofthe target addresses. The filling step can be omitted if data passed through a special proces-sor that filled the distribution list structure and marked data attribute appropriately (thisallows client-specific processing). Processor can also poll session management module toobtain up to date list of clients for specified session. Session management module alsotakes care of adding new clients to the session as well as removing inactive (stale) ones.When new client sends packets for the first time, session management module adds cli-ent to the distribution list (data from forbidden client has already been dropped by packetclassifier). This mechanism is designed to work with the MBone Tools suite but it canbe easily extended with other possibilities how to work with session management moduleand to add or remove items to/from the distribution lists. Information about the last activ-ity of a client is also maintained by the session module and is used for pruning stale clientsperiodically. Even when distribution list is not filled by the session management module,packets must pass through it to allow addition of new clients and removal of stale ones.

    When the packet targets are determined by the router processor a reference to thepacket is put into the to-be-sent queue. Then the packet scheduler/sender picks up packetsfrom that queue, schedules them for transmission, and finally sends them to the network.Per client packet scheduling can also be used for e. g. client specific traffic shaping.

  • 3.3. RESILIENT OVERLAY NETWORKS 10

    The processor scheduler is not only responsible for the processors scheduling but it alsotakes care of start-up and (possibly forced) shutdown of processors which can be con-trolled via administrative interface of the reflector. It checks resource limits with routingAAA module while scheduling and provides back some statistics for accounting purposes.

    Administrative Part

    Communication with the reflector from the administrative point of view is provided us-ing messaging interfaces, management module, and administrative AAA module of the reflector.Commands for the management module are written in a specific message language.

    Messaging interface is generic entity, which can be instantiated as e. g. RPC, SOAPover HTTP, plain HTTP interface with SSL/TLS or GSI support, or simple TCP connectionbound to loop-back interface of the machine running the reflector. Each of these interfacesunwraps the message if necessary and passes it to the management module. A messagelanguage for communication with the management module is called Reflector Adminis-tration Protocol (RAP) described in [19]. More information on administrative part of thereflector can be found in [32] and [31].

    3.2.2 Usage of the ReflectorThe basic function of the reflector is retransmission of received data to one or more listen-ers. This can be easily extended to support other useful functions. The reflector replicatesall the traffic coming through specified port to all the clients connected to that port. MBoneTools based clients do not need to interact in advance—they just connect to the reflector toautomatically receive all the traffic sent to the reflector and also all the client traffic is auto-matically distributed by the reflector. The reflector security (per port or per client) policymay change this behavior and forbid some clients from listening or sending data.

    3.3 Resilient Overlay NetworksResilient Overlay Network (RON) approach [2] aims to built rather general overlay net-work on top of IP based network to improve speed of recovery from network outages andto improve routing between hosts in different autonomous systems using more compli-cated metrics than simple hop counts. The whole system is based on assumption that whilevery simple metrics and robust routing information distribution is required in the Inter-net that comprises hundreds millions of nodes (as presented by BGP–4), overlay networksof limited size (ranging from 2 to 50 RON forwarders) can use much more sophisticatedmechanisms.

    RON evaluates three basic metrics in order to choose optimum path: (a) latency, (b)packet loss, (c) estimated maximum TCP throughput. The topology information is dissem-inated using link-state algorithm. RON is capable of failure detection and recovery in 18 son average and the routing “detour” usually includes no more than one additional RONforwarder. RON also allows application integration and expressive policing to chooseoptimum path for specific applications (e. g. some application might need as low trans-mission latency as possible while other one needs maximum TCP throughput available).RON attempts to perform per flow routing to avoid sending data from one applicationover multiple parallel links to get rid of packet reordering problem.

    Other similar approaches based on general overlay networks to improve performanceor to implement features not available in the underlying network include Detour [61] andX-Bone [64].

  • 3.4 MULTIMEDIA PROC. AND DISTRIB. SYS. BASED ON OVERLAY NETWORKS 11

    3.4 Multimedia Processing and Distribution Systems Basedon Overlay Networks

    Current most advanced high-speed networks proceed in direction of being very fast whileproviding just very simple (basic) services only. Furthermore, lessons learnt from prob-lems with implementing advanced functionality in the network layer show that it mightbe more appropriate to create overlay networks built on top of “dumb and fast” networksto provide advanced features needed. The overlay networks built on IP networks usejust the basic unicast routing and transport functionality of underlying network while theother functions are implemented in a way orthogonal to the IP infrastructure. When someproblem with overlay networks occurs, the underlying IP network is never influenced andworks fine for other traffic and other clients.

    3.4.1 Virtual Room Videoconferencing System (VRVS)Virtual Room Videoconferencing System (VRVS) [95] is one of the most popular systemsbased on ideas of overlay networks built on top of unicast networks with UDP/RTP packetreflectors taking care of distributing packet to all conference participants. It was originallydeveloped at CERN and Caltech for communication of physicists studying behavior ofhigh energy particles, who are spread all over the world and need to discuss their scien-tific problems. On the client side, the VRVS uses primarily MBone tools. Additional toolslike chat or VNC [58] are also available. The VRVS uses web based portal as the video-conferencing front-end. Administration of the reflectors is done by a small closed groupof people called VRVS administrators. To get a reflector local to some user group, the usergroup needs to provide a Linux based computer with remote access (typically using SSH)in advance and the VRVS administrators install there all the necessary software and alsotake care of supporting it afterwards. When using the VRVS, the system automaticallyselects the most appropriate reflector based on location of the users—actually the geo-graphically closest one is usually selected—and the users are given no option to change it.This may not avoid transmitting lots of data trough costly lines.

    From the distributed multimedia processing point of view, the unique feature of VRVSreflectors is a gateway functionality that allows VRVS to interconnect the world of MBonetools with a world of H.323 videoconferencing systems. Theoretically, the only thingneeded to handle when interconnecting MBone Tools and H.323 is creation/removal ofH.323 signaling protocols as the MBone Tools use no real signaling protocols and the ba-sic video and audio transmission formats remain the same (typically H.261 for video andµ-law/a-law for audio). However, as VRVS is closed source system provided as a service,it is hard to verify what is the real processing architecture and whether the H.323/MBoneTools conversion processing is really distributed or whether the reflectors work as relaysonly and the processing is performed in centralized manner.

    3.4.2 Access GridThe Access Grid (AG) system [9, 10, 23] has been built in order to enable communication ofresearchers that collaborate in Grid environments. The AG communication environmentuses MBone Tools videoconferencing software and basically relies on multicast supportin the network. Integration of AG with DV video transmissions based on DVTS tools(Sec. 6.1) has been tested [29] relying again on multicast or even layered multicast [28, 54].However because of known problems with multicast deployment, it uses a system forbridging multicast sessions between multicast enabled network and a network with onlyunicast connectivity (or network which is locally multicast enabled but there is no multi-cast peering with other networks) and another system to run videoconferencing sessionson unicast network only. The first scenario uses QuickBridge software on some site which

  • 3.4 MULTIMEDIA PROC. AND DISTRIB. SYS. BASED ON OVERLAY NETWORKS 12

    has multicast connectivity to the AG network and clients in unicast network then pointtheir client tools directly to the QuickBridge server. The second scenario is based on soft-ware called Multi-Session Bridge that is run on servers at Argonne National Laboratory(ANL), which are also running the basic AG infrastructure. Client on unicast network canjoin AG using vtc client which is provided within AG software suite.

    QuickBridge [16] was developed at the University of Southern California and has beenfurther enhanced by AG team. It is simple reflector which uses IP address/subnet basedauthentication. There is a database of AG rooms and corresponding multicast addressesand ports maintained for QuickBridge and thus its administrator just specifies which roomshould be bridged.

    The Multi-Session Bridge1 (MSB) was created at Fermi National Accelerator Labora-tory [59]. It consists of a server called msb, a client called vtc, and a web server web-vtc.It uses its tunneling capabilities to create an overlay network which can bridge both unicastand multicast videoconferences that are using RTP v2 streams (like MBone Tools). Again,the authentication is based on IP address restrictions and it also features simple plain-textclient-server authentication. Furthermore, bridging can be restricted to specified directiononly. The MSB is known to have scalability problems as it is the centralized solution andwe can confirm these problems based on our experience.

    From the distributed processing point of view there is no processing on these reflectorsexcept for the distribution of the streams to all clients.

    3.4.3 H.323 VideoconferencesH.323 based videoconferences use H.323 signaling protocol as an all-encompassing en-velope for underlying protocols for channel negotiation, setup, and teardown, and forvideo, audio, and data transmission. H.323 videoconferencing protocol was designed tosupport directly point-to-point videoconferences only. Multi-point videoconferencing ca-pabilities are provided by hardware or software devices called Multi-point ConnectionUnits (MCUs). With the only two exceptions, these devices are usually either expensivehardware boxes with professional features like power supply redundancy designed formaximum reliability or some limited versions are built into high-end videoconferencingstations. As for those two exceptions: the first is VRVS system described above and sec-ond is open source implementation called OpenMCU described later in this section. Bothof these have rather limited capabilities compared to full-featured MCUs.

    Typical MCU can provide several videoconferencing modes like scaling down andmerging several video streams together to fit onto one screen, or video switching wherevideo follows the most loud audio source. Audio is combined in a way that all the par-ticipants can hear one another. However, it is a common problem that one participantproduces some strange sounds and nobody except for the MCU administrator is able tocut him/her off the videoconference. Even the MCU administrator might face seriousproblems since only a few of H.323 implementations feature reasonable user identification(although it is a feature defined in H.323 standard).

    There is single open source software implementation of H.323 MCU called OpenMCU.It has been created as a part of OpenH323 Project [86]. OpenH323 project also providesother open source tools needed for H.323 tool chain—e. g. a gatekeeper called OpenGK,an OpenPhone client and a H.323 answering machine called OpenAM. The OpenMCUimplementation is written using H.323 library which is the basic component created bythe OpenH323 project and which is also being used by many other projects (e. g., softwareH.323 client GnomeMeeting). The OpenMCU is known to work on FreeBSD and Linuxoperating systems. It features G.711, GSM MS–GSM, and LPC–10 audio codecs supportand H.261 video codec support. It features also multiple parallel videoconferences using“room” concept.

    1Early versions of the VRVS system were created on basis of the MSB.

  • 3.5. USE OF CLUSTERS AS DISTRIBUTED ROUTERS 13

    As for the processing, the audio streams are combined on OpenMCU and thus can beheard from all the participants but the video stream can only be seen from maximum offour users that are actively talking. The H.261 streams are down-sampled to 1/4 of originalsize and the resulting images are placed into 2×2 grid.

    3.5 Use of Clusters as Distributed RoutersThere is a known effort to use computer clusters with low-latency interconnecting infras-tructure as high performance and scalable routers. Probably the most advanced and the-oretically founded achievement is Suez project [12, 13]2, based on commodity PC clusterswith Myrinet interconnection.

    The system works as follows: each cluster node has an internal interface to Myrinetswitch for internal communication within the cluster and optionally one or more externalinterfaces. Both internal and external interfaces rely on specific capabilities of Myrinetinterface cards and drivers (e. g. PeerDMA transfer). From the external point of view, therouting is performed among external interfaces.

    Suez uses a routing-table search algorithm that exploits CPU cache for fast lookup bytreating IP addresses directly as virtual addresses. To scale the number of real-time con-nections supportable with given link speed, Suez implements a fixed-granularity fluid fairqueuing (FGFFQ—also called DFQ staying for Discretized Fair Queuing) algorithm [11]that eliminates the per-packet overhead associated with FIFO or per-connection overheadfor real-time scheduling based on the conventional weighted fair queuing algorithms.

    Another project which distributes processing load on active network elements is Ac-tive Network Node [17] that relies on specialized hardware. Software DSM project [27]attempts to build efficient distributed memory for the closely coupled clusters for usingthem as active routers. There is yet another similar project called Cluster-based ActiveNetwork Router [42]. However none of the above mentioned projects addresses finer thanper-address network load distribution and thus there is no need for solving packet reorder-ing issues.

    3.6 Use of Clusters as Distributed ServersA number of servers based on utilizing computer clusters as distributed servers are avail-able. Most distributed servers are prototyped as web servers [3, 7, 57] for simplicityreasons and also because rather standard and straightforward performance evaluation isavailable. For example, Carrera and Bianchini recently demonstrated cluster based webserver called PRESS [8] concentrating on demonstrating advantages of user level commu-nication like low processor overhead, remote memory accesses, and zero-copy transfers.For prototype implementation, they used Virtual Interface Architecture user-level stan-dard for intra-cluster communication. It is designed as locality consciousness server [57]in order to utilize caching of the served data. After evaluating performance, they foundthat user-level communication is more than 50% more efficient compared to kernel levelcommunication and achieved close to linear scalability up to 8 cluster nodes, which is themaximum they used for evaluation.

    2For some reason, the only publicly available article detailing Suez principles and internals is available whenunpacking Suez distribution available at http://www.ecsl.cs.sunysb.edu/suez.tar. Further informa-tion has been obtained via private communication with authors.

  • 3.7. PEER-TO-PEER NETWORKS 14

    3.7 Peer-to-Peer NetworksThe peer-to-peer (P2P) networks gained enormous popularity, both positive and nega-tive, for file sharing and distribution. These systems provide very robust functionalityfor neighbor discovery and failure tolerance. However it seems to be hard to find a goodcompromise between scalability or efficiency and robustness of the whole system. Thereare several possible architectures that are employed in P2P networks that achieve differentratios of scalability and robustness. A good overview of P2P architecture is given in [65].

    Pure or decentralized systems. In this model, there is no central authority and all nodesare equal and thus there is no single point of failure. However due to lack of hierarchicalstructure there are problems with scalability this class of systems—e. g. in file sharingnetworks resulting in flooding P2P network of search queries.

    Centralized systems. This class of systems is on the opposite extreme of P2P spectrum.There is central authority, which is used for directory and discovery services makingsearches and discovery very efficient but resulting in a single point of failure. Centralnode can also become overloaded as both the network and number of requests grows.

    Super-peer systems. This type of network is similar to pure system but it reintroducesnotion of hierarchy built in a robust way. The peers are organized into “clusters” with onenode elected as super-peer, which performs some “server” activities on behalf of all nodesin the “cluster” (e. g. maintains indices and answers queries). The election system may bebased on number of parameters like availability of bandwidth, enough processing power,or enough storage and any node can become super-peer if elected. To increase robustnessit is possible for each cluster to have more than just one super-peer forming k-redundant“virtual super-peer”, where k is a number of super-peers per a cluster.

    3.8 OptIPuterThe OptIPuter project [48, 87] is probably the most advanced project of general distributedprocessing environment aimed at utilization of current high-speed optical networks andpowerful distributed computing and storage infrastructure built as a part of Grid projects.Based on presumption that network capacity grows faster than storage capacity whichgrows faster than processing capacity, this project is focused on the re-optimization of theentire Grid stack of software layers to enable “wasting” bandwidth and storage in orderto conserve rather “scarce” computing resources. The OptIPuter can be understood as a“virtual” parallel computer, in which the individual “processors” are clusters distributedacross many regions; the “memory” takes on the form of large and fast distributed datastorage; “peripherals” are e. g. scientific instruments, displays or sensor arrays; and thecommon infrastructure forming the virtual “motherboard” uses standard IP layer deliv-ered over multiple dedicated lambda circuits3. The prototype of OptIPputer is being builton campuses and metropolitan and state-wide optical fiber networks in southern Califor-nia and in Chicago.

    3The “dedicated lambda” term is used in networking jargon to describe a dedicated circuit that is based onseparate wavelength (or even fiber sometimes) used for such circuit on optical layer of the network.

  • Chapter 4

    Networks of Active Elements

    A virtual multicasting environment, based on an active network element called “reflec-tor” [32] has been successfully used for user-empowered synchronous multimedia distri-bution across wide area networks. While quite robust replacement for native, but not reli-able multicast used for videoconferencing and virtual collaborative environment for smallgroups, its wider deployment is limited by scalability issues. This is especially importantwhen high-bandwidth multimedia formats like Digital Video are used, when processingand/or network capacity of the reflector can easily be saturated.

    A simple network of reflectors [33] is a robust solution minimizing additional latency(number of hops within the network), but it still has rather limited scalability. In this pa-per, we study scalable and robust synchronous multimedia distribution approaches withmore efficient application-level distribution schemes. The latency induced by the networkis one of the most important parameters, as the primary use is for the real-time collabora-tive environments. We use the overlay network approach, where active elements operateon an application level orthogonal to the basic network infrastructure. This approach sup-ports stability through components isolation, reducing complex and often unpredictableinteractions of components across network layers.

    4.1 Synchronous Multimedia Distribution NetworksA synchronous multimedia distribution network, which operates at high capacity andlow latency, can be composed of interconnected service elements—so called active elements(AEs). They are a generalization of the user-empowered programmable reflector [32].

    The reflector is a programmable network element that replicates and optionally pro-cesses incoming data usually in the form of UDP/RTP datagrams, using unicast communi-cation only. If the data is sent to all the listening clients, the number of data copies is equalto the number of the clients, and the limiting outbound traffic grows with n(n− 1), wheren is the number of sending clients. The reflector has been designed and implementedas a user-controlled modular programmable router, which can optionally be linked withspecial processing modules in run-time. It runs entirely in user-space and thus it workswithout need for administrative privileges on the host computer.

    The AEs add networking capability, i. e. inter-element communication, and also capa-bility to distribute its modules over a tightly coupled cluster. Only the networking capa-bility is important for scalable environments discussed in this paper.

    Local service disruption—element outages or link breaks—are common events in largedistributed systems like wide area networks and the maximum robustness needs to be nat-urally incorporated into the design of the synchronous distribution networks. While themaximum robustness is needed for network organization based on out-of-band controlmessages, in our case based on user empowered peer to peer networks (P2P) approach de-

  • 4.2. ACTIVE ELEMENT WITH NETWORK MANAGEMENT CAPABILITIES 16

    KernelMessagingModules

    Processors

    NetworkListeners

    PacketScheduler/Sender

    Shared Memory

    Network Management Network Information Service

    data flow control information

    FIGURE 4.1: Architecture of the Active Element.

    scribed in Sections 4.2.1 and 4.4, the actual content distribution needs carefully balancedsolution between robustness and performance as discussed in Section 4.3. The contentdistribution models are based on the idea that even sophisticated, redundant, and com-putationally demanding approaches can be employed for smaller groups (of users, links,network elements, . . . ), as opposed to simpler algorithms necessary for large distributedsystems (such as the global Internet). A specialized routing algorithm based on similarideas has been shown, e. g. as part of the RON approach [2].

    4.2 Active Element with Network Management Capabili-ties

    As already mentioned in Sec. 4.1, the AE is the extended reflector with the capability tocreate network of active elements to deploy scalable distribution scenarios. The networkmanagement is implemented via two modules dynamically linked to the AE in the run-time: Network Management (NM) and Network Information Service (NIS). The NM takescare of building and managing the network of AEs, joining new content groups and leav-ing old ones, and reorganizing the network in case of link/node failure.

    The NIS serves multiple purposes. It gathers and publishes information about the spe-cific AE (e. g. available network and processing capacity), about the network of AEs, aboutproperties important for synchronous multimedia distribution (e. g. pairwise one-way de-lay, RTT, estimated link capacity). Further, it takes care of information on content andavailable formats distributed by the network. It can also provide information about spe-cial capabilities of the specific AE, such as multimedia transcoding capability.

    The NM and NIS modules can communicate with the AE administrator using admin-istrative modules of the AE kernel. This provides authentication, authorization, and ac-counting features built into the AE anyway and it can also use Reflector Administration

  • 4.2. ACTIVE ELEMENT WITH NETWORK MANAGEMENT CAPABILITIES 17

    Protocol (RAP) [19] enriched by commands specific for NM and NIS. The NM communi-cates with the Session Management module in the AE kernel to modify packet distributionlists according to participation of the AE in selected content/format groups.

    4.2.1 Organization of AE NetworksFor the out-of-band control messages, the AE network uses self-organizing principles al-ready successfully implemented in common peer to peer network frameworks [60],[65],namely for AE discovery, available services and content discovery, topology maintenance,and also for control channel management. The P2P approach satisfies requirements onboth robustness and user-empowered approach and its lower efficiency has no significantimpact as it routes administrative data only.

    The AE discovery procedure provides capability to find other AEs to create or join thenetwork. The static discovery relies on a set of predefined IP addresses of other AEs, whilethe dynamic discovery uses either broadcasting or multicasting capabilities of underlyingnetworks to discover AE neighborhood. Topology maintenance (especially broadcast oflink state information), exchange of information from NIS modules, content distributiongroup joins and keep-alives, client migration requests, and other similar services also usethe P2P message passing operations of AEs.

    4.2.2 Re-balancing and Fail-Over OperationsThe topology and use pattern of any network changes rather frequently, and these changesmust be reflected in the overlay network, too. We consider two basic scenarios: (1) re-balancing is scheduled due to either use pattern change or introduction of new linksand/or nodes, i. e. there is no link or AE failure, and (2) a reaction to a sudden failure.

    In the first scenario, the infrastructure re-balances to a new topology and then switchesto sending data over it. Since it is possible to send data simultaneously over both old andnew topology for very short period of time (what might result in short term infrastruc-ture overloading) and either the last reflector on the path or the application itself discardsthe duplicate data, clients observe seamless migration and are subject to no delay and/orpacket loss due to the topology switch. This scenario also applies when a client migratesto other reflector because of insufficient perceived quality of data stream.

    On the contrary, a sudden failure in the second scenario is likely to result in packet loss(for unreliable transmission like UDP) or delay (for reliable protocols like TCP), unless thenetwork distribution model has some permanent redundancy built in. While multicastdoesn’t have such a permanent redundancy property, the client perceives loss/delay untila new route between the source and the client is found. Also in the overlay networkof AE without permanent redundancy, the client needs to discover and connect to newAE. This process can be sped up when client uses cached data about other AEs (from theinitial discovery or as a result of regular updated of the topology). For some applications,this approach may not be sufficiently fast and permanent redundancy must be applied:the client is continuously connected to at least two AEs and discards the redundant data.When one AE fails, the client immediately tries to restore the degree of redundancy byconnecting to another AE. The same redundancy model is employed for data distributioninside the network of AEs, so that re-balancing has no adverse effect on the connectedclients.

    The probability of failure of a particular link or AE is rather small, despite high fre-quence of failures in global view of large networks. Thus the two fold redundancy (k = 2)might be sufficient for majority of applications, with possibility to increase it (k > 2) forthe most demanding applications.

  • 4.2. ACTIVE ELEMENT WITH NETWORK MANAGEMENT CAPABILITIES 18

    Fast Failure Detection and Recovery for Simple Models without Redundancy

    In this section we describe general algorithm for fast detection and recovery from AE fail-ure in category of simple distribution models.

    Definition 4.1 (Simple distribution models) Simple data distribution model is any distri-bution scheme, where data traverses at most two AE inside the distribution network, i. e.one ingress and one egress AE with possibility of ingress and egress being the same AE. 2

    Examples of simple models are 2D full-mesh networks (Section 4.3.1) and 3D layered-mesh networks (Sec. 4.3.2), while 3D with intermediate AEs (Sec. 4.3.3) and multicast-likeschemes (Sec. 4.3.4) are not.

    The following preliminary steps are needed for fast detection algorithm to be put inplace:

    1. The client chooses and joins one AE as its primary AE.

    2. The client chooses one AE as its backup AE. The client informs the backup AE thatis has been chosen the backup AE for (client,primary AE) pair.

    3. Both client and backup AE subscribe for keep-alive messages from primary AE.

    The failure detection works as follows:

    • Failure of primary AE is recognized by node X (be it the client or the backup AE)when keep-alive messages are not received for grace period GRACE (X) (expressedin seconds).

    • When backup AE recognizes failure of primary AE, it immediately starts to senddata to the client (i. e. it considers client as if it just joined).

    • When the client recognizes failure of primary AE, it immediately joins the backupAE by announcing the backup AE that it has just become primary one.

    For such model, we observe the following properties (A . . . the primary AE, B . . . thebackup AE, C . . . the client, OWD (X → Y ) . . . one-way delay from X to Y , t0 . . . instantof primary AE failure, GRACE (X) . . . failure detection period by node X).

    • Failure detection of primary AE by the client

    t = t0 + OWD (A → C) + GRACE (C)

    • Failure detection of primary AE by the backup AE

    t = t0 + OWD (A → B) + GRACE (B)

    • Reception recovery ∆trrt1 = t0 + OWD (A → C)

    t2 = t0 + OWD (A → B) + GRACE (B) + OWD (B → C)

    t3 = t0 + OWD (A → C) + GRACE (C) + OWD (C → B) + OWD (B → C)

    ∆trr = min{t2, t3} − t1

    = min{OWD (A → B) + GRACE (B),OWD (A → C) + GRACE (C) + OWD (C → B)}−OWD (A → C)

  • 4.3. DISTRIBUTION MODELS 19

    • Distribution recovery ∆tdr

    ∆tdr = OWD (A → C) + GRACE (C) + OWD (C → B)

    Data hollow in terms of data timestamps ∆tdh

    ∆tdh = (t1 − OWD (C → B)) − (t0 − OWD (C → A))= OWD (A → C) + GRACE (C) + OWD (C → A)

    This model assumes reliable network, i.e. it can’t happen that backup AE detects pri-mary AE failure while the primary AE works fine for the client. This can be improved byclient sending stop message to backup AE when it starts receiving the same data from bothprimary and backup AE.

    Similar detection models can be used for non-simple networks, but in these networksit is impossible to state general formulas for recovery. As an addition to failure detectiondelay and failure announcement distribution latency, which are similar in both simple andnon-simple distribution models, it also includes delay caused by recomputing or rebuild-ing of the distribution model for non-simple models (there is no need to recompute thedistribution model for simple ones). This phase is generally hard to estimate as it involvescomplex distributed system with many variable. E. g. for multicast-like schemes, there areadditional delays stemming from recomputation and/or distribution of new minimumspanning trees (or if alternative MSTs are available at each AE, there is still some delaydue to broadcast of new MST ID for all AEs to switch to the same MST). Furthermore, ifthe failed AE was the root of the previous MST, new tree root needs to be elected.

    4.3 Distribution Models

    4.3.1 2D Full MeshThe simplest model with higher redundancy, serving also as the worst case estimate interms of scalability, is a complete graph in which each AE communicates directly with allthe remaining AEs, as shown in Figure 4.2. This model was studied and described in detailin [33].

    FIGURE 4.2: 2D full mesh.

    Definition 4.2 (2D full-mesh network) Let’s have a network with AEs and clients witheach AE populated with at least one client. 2D full-mesh network of AEs is a network, inwhich each AE sends data from each client to other client connected to the same AE and

  • 4.3. DISTRIBUTION MODELS 20

    it sends the data also to all other AEs in the network. Each AE thus receives the data fromall other AEs and sends them to all the clients connected to that AE. 2

    Let’s assume network of mtot AEs with full-mesh communication. n clients connect tothe AEs in such way that each AE has either nr or nr − 1 clients.

    nr =

    ⌈n

    mtot

    (4.1)

    m1 = nrmtot − n (4.2)

    m = mtot − m1 (4.3)

    Definition 4.3 (Evenly populated AE network) A AE network with each AE having ei-ther nr or nr − 1 clients connected is called evenly populated AE network. All client areactive, i. e. both sending and receiving. 2

    Theorem 4.1 In an evenly populated 2D full mesh of AEs, the inbound traffic is in = n streams.

    PROOF When full mesh operates in N:N way, the inbound traffic for AEs with nr clientswill be

    in = nr︸︷︷︸

    1

    + (m − 1)nr︸ ︷︷ ︸

    2

    + m1(nr − 1)︸ ︷︷ ︸

    3

    (4.4)

    (1 . . . directly connected clients, 2 . . . streams from m − 1 other AEs with nr clients, 3 . . .streams from all m1 clients with nr − 1 clients) and for AEs with nr − 1 clients

    in1 = nr − 1︸ ︷︷ ︸

    4

    + mnr︸︷︷︸

    5

    + (m1 − 1)(nr − 1)︸ ︷︷ ︸

    6

    (4.5)

    (4 . . . directly connected clients, 5 . . . streams from all m AEs with nr − 1 clients, 6 . . . fromother m1 − 1 AEs with nr − 1 clients).

    It can be easily shown that in = in1 and after some simplification the in formula canbe written as

    in = nrm + m1nr − m1 (4.6)

    Further substituting m and m1 we get

    in = n (4.7) �

    Theorem 4.2 In an evenly populated 2D full mesh of AEs, the limiting traffic in this mesh is theoutbound traffic on the AE which is out = n2rmtot + nr(mtot − 2) streams.

    PROOF Outbound traffic for AE with nr clients will be

    out = (nr − 1)nr︸ ︷︷ ︸

    7

    + (m − 1)n2r︸ ︷︷ ︸

    8

    + m1nr(nr − 1)︸ ︷︷ ︸

    9

    + nr(m + m1 − 1)︸ ︷︷ ︸

    2×10

    (4.8)

    (7 . . . from directly connected clients to directly connected clients (the AE doesn’t senddata to the client which sent them!), 8 . . . data from m−1 AEs with nr clients to all own nrclients, 9 . . . data from all m1 AEs with nr − 1 client to all own nr clients, 10 . . . data sentto other m + m1 − 1 AEs) and for AE with nr − 1 clients

    out1 = (nr − 2)(nr − 1)︸ ︷︷ ︸

    11

    + mnr(nr − 1)︸ ︷︷ ︸

    12

    + (m1 − 1)(nr − 1)2

    ︸ ︷︷ ︸

    13

    + (nr − 1)(m + m1 − 1)︸ ︷︷ ︸

    2×14

    (4.9)

  • 4.3. DISTRIBUTION MODELS 21

    13

    987

    1

    nr nr − 1

    4

    nr − 1

    nr

    2

    10

    14

    6143

    105

    1112

    FIGURE 4.3: Flow analysis in full 2D mesh of AEs. Bottom and right AEs arepopulated with nr − 1 clients, while top and left AEs are populated with nrclients.

    (11 . . . from directly connected clients to directly connected clients (the AE doesn’t senddata to the client which sent them!), 12 . . . data from m AEs with nr clients to all own nr−1clients, 13 . . . data from other m1 − 1 AEs with nr − 1 client to all own nr − 1 clients, 14 . . .data sent to other m + m1 − 1 AEs). The numbers in equations correspond to numbers inFigure 4.3 on page 21.

    It can be easily shown thatout1out

    =nr − 1

    nr.

    and we can also use use just simplified out formula

    out = nr(nrm + m1nr + m − 2) (4.10)

    as out > out1 for nr ≥ 1 and m + m1 ≥ 2. For m + m1 < 2 the full mesh loses sense andthus out is the limiting value for outbound traffic.

    Further substituting m and m1 we get

    out = nr(mtot + n − 2) = n2rmtot + (m − 2)nr (4.11) �

    If we substitute the other way round the nr from (4.1) (which is not precise due to ceilfunction) we get

    out =n(mtot + n − 2)

    mtot(4.12)

    and the ratio between out for full mesh of AEs and single AE out = n(n − 1) is

    ratio =mtot + n − 2

    mtot(n − 1)(4.13)

  • 4.3. DISTRIBUTION MODELS 22

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15# Clients

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    Sen

    t B

    and

    wid

    th [

    Mb

    ps]

    m=1m=2m=4m=8m=12saturation of GE

    FIGURE 4.4: Behavior of 2D full mesh for DV clients. Dependence of limitingoutbound traffic on the number of 30 Mbps clients and the number of AEsin the mesh.

    Fail-Over Operation

    When a link or whole AE drops out in the full mesh, the accident only influences datadistribution from/to the clients connected to that AE. In case of link failure inside the AEmesh, the client is requested to migrate to an alternative AE. In case that AE itself failsthe client initiates migration on its own. Alternative AEs should be selected randomly todistribute load increase more evenly and the load increase will be at most d nrm−1e. Wheneven this migration delay is not acceptable, it is possible for a client to be permanently con-nected to an alternative AE and just switch the communication. For even more demandingapplications, the client can use more than one AE for sending in parallel.

    Although this model seems to be fairly trivial and not that interesting, it has two basicadvantages: first, the model is robust and failure of one node influences only data from/tothe clients connected to that AE. Second, it introduces only minimal latency because thedata flows over two AEs at most. Next we will examine another model that has the samelatency and robustness properties but that scales better.

    4.3.2 3D Layered-Mesh NetworkThe layered mesh model creates k layers, in which data from a single AE are only dis-tributed as shown in Figure 4.5. One layer is thus similar to 2D full mesh network exceptfor that only one AE is both sending and receiving in one layer. Thus each client is con-nected to one layer for both sending and receiving (sending only if nr = 1; to receive datafrom clients sending data via other layers, the client needs to receive data from remainingnr − 1 clients of the AE used for sending) and to all other layers for receiving only. Eachlayer comprises m AEs. For the sake of simplicity, we first assume that k = m and each AEhas nr clients, thus nr = nm =

    nk .

    Definition 4.4 Active AE is an AE with clients that are both sending and receiving. Non-active AE is an AE with all clients receiving only. 2

  • 4.3. DISTRIBUTION MODELS 23

    FIGURE 4.5: 3D layered mesh.

    Definition 4.5 (3D layered-mesh network) A network of AEs organized into 3D layered-mesh comprising k layers and m AEs in each layer. Each layer is used to distribute datafrom clients connected to one active AE. Thus each client is connected to one layer forsending and receiving and all other layers for receiving only. 2

    Theorem 4.3 In 3D layered-mesh network with each AE having nr clients, each AE has in = nrinbound streams.

    PROOF The active AE has nr clients and thus it receives in = nr streams. Each non-activeAE receives in = nr streams to distribute to its clients from active AE. �

    Theorem 4.4 In 3D layered mesh network with each AE having nr clients, the active AEs haveouts/r = n

    2r + nr(m − 2) outbound streams, and the non-active AEs have out r = n2r outbound

    streams.

    PROOF Number of output streams for active AE with both sending and receiving (s/r)clients is

    outs/r = nr(nr − 1)︸ ︷︷ ︸

    1

    + nr(m − 1)︸ ︷︷ ︸

    2

    = nr(nr + m − 2) = n2r + nr(m − 2) (4.14)

    where part 1 is for directly connected clients and part 2 is for all the remaining m − 1non-active AEs in the same layer.

    For non-active AE that has only receiving (r) clients connected

    outr = n2r. (4.15)

    because the non-active AE distributes nr streams (from nr clients of active AE) to its ownnr clients.

    It is obvious that for nr ≥ 1 and m > 2, it always hold outs/r > outr, for m = 2,outs/r = outr and for m < 2 the distribution network doesn’t have sense. Thus the out s/rcan be seen as the limiting traffic. �

  • 4.3. DISTRIBUTION MODELS 24

    The limiting throughput is the one which sending clients are connected to. Thus theratio between such mesh (with total number of clients n = nrm and single AE is

    ratio =n + m(m − 2)

    m2(n − 1)(4.16)

    while using total of km = m2 AEs.This model is problematic because of quadratic increase with respect to number of AEs

    used. However it seems to be the last model that doesn’t induce intermediate hops andthus keeps latency at the minimum.

    Transition from 3D to 2D mesh

    It is possible to perform vertical aggregation of AEs across 3D layers to get the 2D fullmesh model as proved in Theorem 4.5.

    Theorem 4.5 The 3D layered mesh model is extension of 2D full-mesh model and the latter modelcan be obtained by aggregating AEs of the first model.

    PROOF To do so, we merge AEs that are positioned above each other (“flatten the layers”).In such case, the the AE is once used as sending/receiving AE and m−1 times as receivingonly AE.

    Thus the number of input streams is m times nr (since once it gets nr as both sendingand receiving AE and m − 1 times it gets nr as receiving one)

    in = mnr = n (4.17)

    This relation is the same as (4.7). For number of output streams, it follows

    out = n2r + nr(m − 2)︸ ︷︷ ︸

    1

    + (m − 1)n2r︸ ︷︷ ︸

    2

    = n2r + (m − 2)nr (4.18)

    Part 1 is one occurrence of AE in sending/receiving role and part 2 is m−1 time occurrenceof AE in receiving only role. The number of outbound streams is obviously equal to (4.11).Thus we have proved that 2D full-mesh model is just special variant of 3D layered-meshmodel. �

    Fail-Over Operation

    Each of the mesh layers monitors its connectivity. When some layer disintegrates andbecomes discontinuous, the information is broadcasted throughout the layer and to itsclients. The clients that used that layer for sending are requested to migrate to randomlychosen layer from the remaining k − 1 layers and the listening-only clients simply discon-nect from this layer. Such behavior increases load on the remaining k − 1 layers but as theclients choose the new layer randomly, the load increases in roughly uniform way by atmost d nrk−1e.

    4.3.3 3D Layered Mesh of AEs with Intermediate AEsDefinition 4.6 (q-nary distribution tree) The q-nary distribution tree is a directed acyclicgraph, in which each parent node has q child nodes. Data in the distribution tree aredistributed according to orientation of edges. 2

    Definition 4.7 (3D layered mesh with intermediate AEs) The 3D layered mesh with in-termediate AEs is layered structure where each layer is organized as follows: each layerhas one active AE that is the root of the q-nary distribution tree in each layer. Receiving-only clients are connected to m − 1 leaf AEs of the distribution tree. 2

  • 4.3. DISTRIBUTION MODELS 25

    Definition 4.8 (Intermediate AE) Intermediate AE is an AE that doesn’t have any clientsdirectly connected. Alternatively, it is any AE that is neither active nor non-active. 2

    Let’s create q-nary tree used for distributing data from AE with sending clients to m−1AEs with listening clients. When building q-nary tree with λ intermediate layers

    λ = logq(m − 1) − 1, (4.19)

    the