Overview of a scalable Network Monitoring Architecture

Overview of a scalable Network MonitoringArchitecture

Providing communication infrastructure characteristics to grid−aware applications

Augusto CiuffolettiDipartimento di Informatica − Univ. di Pisa

next: The environment

index

Overview of a scalable Network Monitoring Architecture

Overview of a scalable Network Monitoring Architecture 1

The environment

A Grid is a network of edge services, seamlessly accessible by authorized users inorder to perform distributed computations.

Users optimize their operations by selecting edge services that fit two basic criteria:

offer the appropriate level of service,• are able to perform efficiently a distributed computation.•

The application of the second criterion is based upon the knowledge of the potentialperformance of the communication infrastructure that supports the distributedcomputation.

next: GGF Grid Monitoring Architecture

up: A modular approach to Grid Connectivity Monitoring

index


The environment 2

GGF Grid Monitoring Architecture

controls a complex monitoring system;• is fault tolerant;• minimizes overhead (over monitored resource):• makes available the results of the monitoring activities, also implementingsecurity;

•

integrates different monitoring tools;• scales well with system size and probe frequency.•

next: An ancestor: the Network Weather Service

previous: The environment


index


GGF Grid Monitoring Architecture 3

An ancestor: the Network Weather Service

nws.jpg

The components of a NWS resource monitoring system:

a nameserver, a centralized controller, that keeps a registry of all componentsand monitoring activities;

•

sensors, that produce resource observations;• memories, that store resource observations;• forecasters, that process resource observations;•

Limits of the NWS architecture, bound to the presence of the nameserver:

a communication bottleneck;• a single point of failure;• no security;•

next: From NWS to LDAP

previous: GGF Grid Monitoring Architecture


index


An ancestor: the Network Weather Service 4

From NWS to LDAP

One limit of NWS is centralized control: in order to obtain data, one has to address(possibly indirectly) the nameserver, specifiying the id of the sensor.

NWS was an "all inclusive" architecture, more a proof of concept than a realapplication. It was not applicable to a grid scale, mostly for the absence of a realdatabase.

The distributed nature of the LDAP architecture is appealing from this point of view:information is organized hierarchically, and the resulting tree (might be) distributedover different servers.

In a Grid perspective, the hierarchy often reflects the organization of the Grid: the firstlevel under the root is composed of continental networks, below are national networksand next local networks. Leaves are single resources, like clusters of computers, singlecomputers or storage elements.

next: Globus and LDAP

previous: An ancestor: the Network Weather Service


index


From NWS to LDAP 5

Globus and LDAP

The well known Globus toolkit is based on LDAP.

However, the LDAP architecture is appropriate to store static information, like thenumber of processors in a cluster, or the size of a disk partition.

When data are more volatile, the LDAP architecture is the wrong choice: it isunsuitable to support frequent write operations.

Examples of characteristics that induce frequent writes:

available computing power on a computing resource (e.g. number of idle nodesin a cluster);

•

percent of free space in a partition;• average roundtrip time on a link•

In all these cases the LDAP architecture suffers serious performance and scalabilityproblems.

next: From Globus LDAP to R−GMA SQL

previous: From NWS to LDAP


index


Globus and LDAP 6

From Globus LDAP to R−GMA SQL

SQL provides a relational database. There are several excellent implementations ofthis database, that are designed for extremely demanding applications.

They offer a good compromise between the distributedness, and the cost of queryoperations.

In particular the database can be replicated in order to improve scalability and faulttolerance, as long as the number of queries operations dominates the number ofwrites.

The R−GMA architecture was developed to address this problem: the scalability of thearchitecture is further improved introducing components that combine data from thedatabase and cache the results (similar to NWS forecasters).

next: Scalability of an end−to−end architecture

previous: Globus and LDAP


index


From Globus LDAP to R−GMA SQL 7

Scalability of an end−to−end architecture

When applied to network monitoring, the above architectures overlook a problem,which becomes evident when Grid size scales up: the number of network resources,intended as paths interconnecting edge resources, grows with the square of Grid size.

The case of resources that are bound to a node is quite different from the case ofresources that represent communication between nodes: the former grow with thenumber of nodes, the latter with its square.

As a consequence, representing communication resources using end−to−endcharacteristics may limit the scalability of a monitoring architecture.

In addition, availability of end−to−end measurements implies that:

nodes that host Grid resources also support network monitoring protocols and• each pair of nodes hosting Grid resources should generate network monitoringtraffic, with an increment of network load that is a square of the nodes.

•

next: A user perspective

previous: From Globus LDAP to R−GMA SQL


index


Scalability of an end−to−end architecture 8

A user perspective

This fact raises a number of problems:

the size of a database containing network service characteristics grows rapidly;• the characteristics hardly track a dynamic infrastructure (routes may change);• each edge service becomes eligible as an active network monitoring node(which consumes local and network resources).

•

On the other hand, a network service that does not reflect an end−to−end path is oflittle help for the user that wants to know, for instance, how fast data will flow from astorage to a computing facility.

In order to bring this problem to a manageable size, we need to take into account apecularity of a Grid:

groups of resident services are controlled by distinct administrations, which managethe infrastructure among local services and commit the communication infrastructurewith the rest of the Grid to third parties.

next: A rationale behind Grid partitioning

previous: Scalability of an end−to−end architecture


index


A user perspective 9

A rationale behind Grid partitioning

groups of resident services are controlled by distinct administrations, which managethe infrastructure among local services and commit the communication infrastructurewith the rest of the Grid to third parties. networkservice

The above point authorizes a hierarchical view of a Grid, as composed of domainsinterconnected by an inter−domain infrastructure.

However, there is a good reason for not extending the hierarchical decompositionbeyond the first level, thus preventing from introducing sub−domains, and further.

Our point is that network service characteristics are difficult to compose: for instancefrom the fact that l(ab) is the loss rate from a to b, and l(bc) is the loss rate from b to c,one can hardly guess which is the value of l(ac).

In our view, decomposition is useful only if one is able to guess which of thecomponents dominates the others when determining a characteristic of the overallnetwork service.

next: Basic issues about Grid partitioning

previous: A user perspective


index


A rationale behind Grid partitioning 10

Basic issues about Grid partitioning

groups of resident services are controlled by distinct administrations, which managethe infrastructure among local services and commit the communication infrastructurewith the rest of the Grid to third parties.

There are two distinct scenarios that need to be considered:

if the inter−domain infrastructure is based on a classical packet switchingtechnology, it is exposed to be the bottleneck of an edge to edge path;

This conclusion comes from the consideration that single administrations try toexploit as much as possible their share of an expensive infrastructure: there islittle sense in a community that leases a 1Gbps long−haul interconnection,while the internal connectivity is based on a 100 Mbps infrastructure.

In a "store and forward" Grid where traffic is reasonably engineered,characteristics should be mainly determined by inter−domain fabric.

•

if the inter−domain infrastructure is based on an optical technology, it isexpected not to be the bottleneck of an edge to edge path.

In that case the inter−domain infrastructure is over−dimensioned: theinter−domain infrastructure is less expensive than the intra domain one.

The economic point of view is opposite with respsct to the previous one: theconsequence is that the bottleneck will be probably within the domain.

This reduces the task of monitoring the overall NxN grid to monitoring Ndomains.

•

We conclude that, in both scenarios, having recognized the "special role" played bythe inter−domain infrastructure helps the task of monitoring the netwrokinfrastructure.

This is also consistent with the architectural foundations of the DiffServ architecture.

next: A network of concepts

previous: A rationale behind Grid partitioning


Basic issues about Grid partitioning 11


index


Basic issues about Grid partitioning 12

A network of concepts

We have defined three concepts:

edge service: a Grid service provided by a host or cluster;• domain: a set in a partition of the edge services;• network service: a Grid service provided by the communication infrastructureto two domains.

•

We need two more key concepts in order to complete our model: theodolite serviceand multihome.

A theodolite service is an instance of edge service, which consists in representing anentry point of a certain domain. For instance:

in a best effort based Grid, it may consist in a node that hosts monitoring tools,and is a target for theodolites of other domains;

•

in a differentiated services based Grid, it may implement what RFC 2475 callsan ingress or egress point.

•

A multihome is an entity that collects several aliases of the same edge service, whenthey must be included into different domains in order to comply with the basicrequirement (an example follows).

UML

next: A simple example

previous: Basic issues about Grid partitioning


index


A network of concepts 13

A simple example

More complex situations:

more theodolites to reflect distinct network service supports;• more network services to reflect distinct classes of service.•

next: Multihomed storage

previous: A network of concepts


index


A simple example 14

Multihomed storage

In the example, storage S1 will be mapped into a multihome with two members: one inthe same domain as Ca, and the other in the same domain as Cb.

next: GlueDomains: concepts@work

previous: A simple example


index


Multihomed storage 15

GlueDomains: concepts@work

Gluedomains is a prototype implementation of the concepts introduced so far: it isincluded in the official Italian release of LCG 6.0.

We envision a best effort only Grid, where theodolites run active network monitoringtools.

Globus MDS is used to make network service characteristics available to users(typically, grid aware applications or Grid monitoring tools like GridICE).

architecture

The overall architecture is split in 4 modules (disregarding monitoring tools):

the GlueDomains database: used to map services to domains and to storetheodolite descriptions;

•

the GlueDomains theodolite management: which controls the activity oftheodolite services;

•

the GMA plugin: which is in charge of transferring observations from themonitoring tools to the GMA back end;

•

the GMA back−end: which submits the GMA plugin output to the GMArepository.

•

next: GlueDomains database

previous: Multihomed storage


index


GlueDomains: concepts@work 16

GlueDomains database

The GlueDomains database implementation is presently centralized, which limits thescalability of the whole structure to a few tens of domains (based on a prospectivesimulation).

However the present solution exhibits several interesting features:

a monitoring host autonomously downloads the description of its activity fromthe GlueDomains server: once the gluedomains daemon is started, it detectsnetwork interfaces and configures the monitoring sessions without humanintervention;

•

a monitoring host keeps its activity synchronized with the content of thedatabase;

•

the inability to carry out a certain monitoring task is harmless, as well as theunreachability of the centralized database.

•

the update of the database is semi−automatic: a script helps the compilation ofits content, starting from XML descriptions of the activity of each monitoringhost. In practice this means that the human interface is O(n).

•

We plan to shift to a fully distributed database, cached on each monitoring host, usinga gossip based technique.

In this frame, the MySQL technology will be probably dropped: a new release, not yetdistributed, implements a SOAP interface to the database.

next: Theodolite management

previous: GlueDomains: concepts@work


index


GlueDomains database 17

Theodolite management

Theodolite management is structured as a hierarchy of processes, rooted in theGlueDomains daemon.

The GlueDomains daemon spawns theodolite processes, each in charge of controllingthe monitoring activity over an interface.

A binding interface−theodolite has been introduced, which is not implicit in theconceptual framework.

Each theodolite periodically checks its activity against database content, in order tosynchronize its activity.

Each theodolite spawns session controllers, each controlling a specific monitoringsession.

In case the session controller terminates, the theodolite re−spawns it, after reloadingthe description from the database and waiting an exponentially growing, randomlybiased delay.

The session controller invokes a wrapper of the appropriate monitoring tool, whichparses and redirects the output of the tool to the GMA plugin

next: Monitoring tools

previous: GlueDomains database


index


Theodolite management 18

Monitoring tools

To improve modularity, each monitoring tool is packaged in a separate RPM, whichincludes the appropriate GlueDomains wrap.

The prototype currently provides three monitoring tools:

a basic ping facility, built using the perl library Net::Ping.It is introduced for basic testing purposes, and is used to checkconnectivity.

♦

Its operation does not depend on neighbor configuration.♦

•

a more sophisticated delay measurement tool, which provides one−way jitterestimates using a convex hull algorithm.

It provides useful information, especially for data transfer operations.♦ Its operation depends on neighbor configuration.♦

•

the well known iperf facility, that measures available bandwidth by saturatingthe line.

It provides low level information, at a high cost.♦ Its operation depends on neighbor configuration, and a single shot can berequested by authorized users only.

♦

•

next: Prototype set up and deployment

previous: Theodolite management


index


Monitoring tools 19

Prototype set up and deployment

Currently, GlueDomains in deployed on about ten hosts, and is included in theGridICE Grid Management project.

The site in Bologna supports the GlueDomains database server, and tests areperformed in a complete mesh.

The first official release came after (only) four testing releases, with no substantialrewriting required so far: mainly each new test release introduced new features.

The architecture has proven to be reasonably GMA−independent: from 1st to 2ndtesting release we replaced a R−GMA plugin with a MDS oriented one, whichpublishes through the GridICE GMA.

The current distribution is RPM oriented and self−installing: to install a newmonitoring host, the local administrator simply installes the required packages, anduntars an archive containing a few sensible items, like database passwords.

map

next: Security issues in GlueDomains

previous: Monitoring tools


index


Prototype set up and deployment 20

http://infnforge.cnaf.infn.it/gridice/

Security issues in GlueDomains

GlueDomains is aware of security problems, but provides a very basic security:

read access to the database is protected by a password, shared by all monitoringhosts;

•

write access is protected by another password;• the tools use passwords to ensure identity of the partners, and capabilities toinvoke on demand sessions;

•

data transfer between the Information Provider and the MDS is assumed tooccur within a protected environment, and is not encrypted.

•

However, we provided "handles" to implement more rigorous security policies, andmore is on the way.

next: Future work (CoreGRID)

previous: Prototype set up and deployment


index


Security issues in GlueDomains 21

Future work (CoreGRID)

The experience with GlueDomains continues within the European CoreGRID project.

Within this frame, we are re−designing GlueDomains in order to keep into accoutissues that were set aside during the design of the first prototype:

full awareness of scalability problems;

the scalability premises of the basic idea were not fully implemented in the firstprototype (a centralized database was used to keep configuration data);

ping like, full mesh sessions do not scale (traffic grows with the square);

•

full awareness of security issues:

password based security of the database is insufficient;

on demand sessions need identification tools.

•

next: Passive monitoring

previous: Security issues in GlueDomains


index


Future work (CoreGRID) 22

Passive monitoring

Passive monitoring eliminates the problem of polynomial growth of monitoringsessions (and consequentialy, monitoring traffic).

However, the problem is moved inside hosts that perform passive monitoring: thenumber of packets they examine grows with the square of the number of sites. This isprobably preferable.

Such hosts might use the content of the domain description database to infer whichtraffic is representative of the performance of the network infrastructure between twodomains.

In this view, active monitoring tools are helpful only when passive monitoring doesnot find useful traffic patterns.

In such event, hosts performing passive monitoring might invoke the execution ofactive monitoring sessions that will either produce observations, or induce significanttraffic patterns.

Such request should be authenticated, for security reasons.

next: Distributed database

previous: Future work (CoreGRID)


index


Passive monitoring 23

Distributed database

The database that describes the overlay domain topology is stored in a distributeddatabase.

Every Network Monitoring Element (the theodolite in the previous terminology)hosts a proxy of such abstract entity.

Such component keeps updated a local image of the database. Updates are propagatedusing an epidemic, peer−to−peer protocol.

The footprint of such protocol is kept under control by adaptively changing thenumber of tokens in the system, using a fully distributed feedback algorithm.

The simulation of such protocol in a system of 1000 NME gave encouraging results:

update frequency 1/5 1/secsoverall traffic 5 MB/secupdate latency 40 secs

with a setup which is appropriate for a classical packet switching environment.

The simulation proved that the algorithm is self−stabilizing under severe stress (forinstance, doubling instantaneously the number of NME

next: Layout of the NME

previous: Passive monitoring


index


Distributed database 24

Layout of the NME

The control plane is in charge of interactions with:user applications, that may request the execution of sessions and access R/W thetopology database;

•

the GIS, that publishes the results of the monitoring activity;• the Certification Authority to retrieve certificates.

Two low level modules are envisioned:

•

the network monitoring sensor which performs passive monitoring;• the module that implements the database proxy

next: Summary

previous: Distributed database


index

•


Layout of the NME 25

Summary

A modular approach to Grid Connectivity Monitoring

The environment• GGF Grid Monitoring Architecture• An ancestor: the Network Weather Service• From NWS to LDAP• Globus and LDAP• From Globus LDAP to R−GMA SQL• Scalability of an end−to−end architecture• A user perspective• A rationale behind Grid partitioning• Basic issues about Grid partitioning• A network of concepts• A simple example• Multihomed storage• GlueDomains: concepts@work• GlueDomains database• Theodolite management• Monitoring tools• Prototype set up and deployment• Security issues in GlueDomains• Future work (CoreGRID)• Passive monitoring• Distributed database• Layout of the NME•

•

Summary

previous: Layout of the NME

index

•


Summary 26

Overview of a scalable Network Monitoring Architecture

Documents

Transcript of Overview of a scalable Network Monitoring Architecture