[IEEE 2011 IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) - Dublin,...

15
12th IFIP/IEEE 1M 2011: Application Session «STCT Shyyunn Sheran Lin, Gregory S. Thompson, Viren Malaviya SSTG, Cisco Systems 170 W Tasman Drive, San Jose, California, U.S.A. ([email protected], [email protected]l IEEE 1M 2011 Cisco Systems Cisco.com A ma j or task of managing a computer network is to gather the inventory of the devices, including hardware, soſtware, and the configuration. It is desirable to collect this information efficiently. Traditional Network Management tools require an appliance that resides on the network to collect device information. This paper introduces a distributed, embedded approach to collect network device information without an extra appliance. Utilizing device programmability with add on scripts, the network devices can communicate with each other and perform the inventory collection tasks. The collected inventory information can be sent to network management stations or hosted inventory reporting applications. This mechanism is used on Cisco devices without any device OS upgrade by using a set of scripts to collect CLI and SNMP information om the devices. Utilizing the computing power on multiple networking devices, this distributed mechanism can concuently collect the whole network inventory efficiently. This paper gives the overview of the approach, then drills into the architecture, mechanism of the scripts, the coding standards, performance, impact to the devices, scalability, benefits, and discusses ture work. 978-1-4244-9221-31111$26.00 ©2011 IEEE 745

Transcript of [IEEE 2011 IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) - Dublin,...

12th IFIP/IEEE 1M 2011: Application Session

ABSTRACT

Shyyunn Sheran Lin, Gregory S. T hompson, Viren Malaviya SSTG, Cisco Systems 170 W Tasman Drive, San Jose, California, U.S.A.

([email protected], [email protected]

IEEE 1M 2011 Cisco Systems

Cisco.com

A major task of managing a computer network is to gather the inventory of the devices, including hardware, software, and the configuration. It is desirable to collect this information efficiently. Traditional Network Management tools require an appliance that resides on the network to collect device information. This paper introduces a distributed, embedded approach to collect network device information without an extra appliance. Utilizing device programmability with add on scripts, the network devices can communicate with each other and perform the inventory collection tasks. The collected inventory information can be sent to network management stations or hosted inventory reporting applications. This mechanism is used on Cisco devices without any device OS upgrade by using a set of scripts to collect CLI and SNMP information from the devices. Utilizing the computing power on multiple networking devices, this distributed mechanism can concurrently collect the whole network inventory efficiently. This paper gives the overview of the approach, then drills into the architecture, mechanism of the scripts, the coding standards, performance, impact to the devices, scalability, benefits, and discusses future work.

978-1-4244-9221-31111$26.00 ©2011 IEEE 745

IEEE 1M 2011, Cisco Systflms

This solution uses an embedded approach to collect device inventory information. It utilizes the device programmability

capability. In this case, scripts were developed to program the devices. The scripts are downloaded from a company website and installed at a gateway device at the customer site, the Gateway device then pushes the scripts to several selected devices. These devices communicate to other devices and gather the required SNMP MIB information and the configuration information by

invoking OS native commands. The collection can be sent to a hosted inventory application through internet or a network

management application hosted at the customer's premises.

The devices in the customer network are categorized into the following 3 roles: Gateway, Collector and End devices. The

Gateway device is a router that runs the program collecting information from the collectors and sends the whole collection to an

application backend server via the Demilitarized Zone (DMZ). The Collector is a router that can run the scripts to collect its own device information and the neighborhood devices information and send back the collection to the Gateway. The End devices can

be any router or switch devices. End devices do not need to have the scripts installed, as long as it can response to SNMP and CLI

commands, the data can be collected. All Gateway, Collector and End Devices are connected in the network.

In the initial phase of the approach, there is only one Gateway device which is a registered device to the backend

application. The scripts are downloaded and installed to this devices, customer will provide a seed file which shows the hierarchy

of Gateway, Collector and End devices information. The relationship of Collectors and End devices are also specified in the seed file. There can be many Collectors and each one is responsible for several end devices. The Gateway device is the master and

center of this whole collection; it is responsible to push the scripts to selected collectors and starts the collection. When a

collection starts, the Gateway device directs the data collection by parsing the master seed file, identifying collectors, and

spawning slave collection policies on each collector. Inside the Collector, separate threads are spawned for each device and

collection type so it can parallel communicate to End devices. Gateways can also be Collectors, so if the network is small, one

gateway device can serve the collection tasks, thus the Collectors can be eliminated. Once the remote Collectors are spawned, the Gateway collects any end devices assigned to it. After collecting Gateway's own end devices, the Gateway sleeps and

periodically wakes up and checks for incoming collections from the Collectors. When all data has been sent to the Gateway, the

data is archived as a complete inventory and a transport policy is launched to send the data securely to remote applications.

To qualify to be a Gateway or Collector, a device must have the Embedded Event Manager feature, security PKI support,

and some free local storage per device collected. Free local storage is usually the limiting factor on how many devices any given Collector can inventory. The Gateway device also needs to have an outside connectivity if it needs to send the collection to other

application outside the customer network. The collection can also be sent to an application that is hosted at customer's premises.

746

Script Based Embedded Collection Approach

.com

Utilize device programmability, scripts will be downloaded to the device

Once setup, collection works transparently in the background

Using the devices themselves for collection saves administering a separate device, and power/HVAC load

No new device OS image changes required

Secure data transfer among devices and to the hosted application

IEEE 1M 2011, Cisco Systems

The Embedded Collector aims to provide a low touch application which requires little or no support from vendors and it

is easy to install on network devices for data collection. This approach can replace dedicated collection appliances for smaller

networks. By remedying external appliance, the approach not only satisfies Green criteria but also saves on power and reduces the Heating, Ventilation, Air-Conditioning (HV AC) needs. The Embedded Collector design leverages lOS programmability feature:

namely Embedded Event Manager (EEM), where EEM policies written in scripts provide the intelligence to collect the data. Once

the policies are installed, collection is automatically scheduled to collect data from the network devices. There are two ways to

launch the collection, one is on demand collection where users invoke scripts to start the collection, the other one is periodic

automatic collection in which an auto scheduled collection can be kicked off from the EEM timer event and users can configure

the intervals of the automatic collection based on application and business requirements. There are mUltiple ways to implement the Embedded Collector approach. One approach is to integrate the collection

capability in the operating systems (OS), which can utilize and integrate the features in the operating systems more coherently.

However, the feature integration with operating systems release usually takes long and the customers resist upgrading with a new

image due significant effort involved in re-certifying and testing. A typical philosophy among network operators adhere to is "If

it isn't broken, do not fix it". Thus, the OS releases running in real customer networks are often 1-2 years behind current releases. The approach introduced in this paper utilizes device programmable scripting capability. This approach is quicker to

develop and more easily adopted by the users, since no OS upgrade is needed. The scripts are easier to upgrade as well as install.

If users want to remove the scripts, an uninstall script can remove the feature. For most customers security of their data is of

paramount importance. As a result the data transferred among devices and the other application are encrypted and secured. This

collection method offers the benefit of getting the complete view of the network inventory without sacrificing the security and

safety of the network information.

747

Device Programmability using Embedded Event Manager

• Event Detectors

"watch for events of interest"

• EEM Server

T he "brains" of the system

• Policies (scripts)

Applets

Tcl-based

All of this is internal to Cisco lOS

POLICY ENGINES - TWO TYPES

Event Subscribers IEEE 1M 2011, Cisco Systems

CISCO.com,

ED notifies EEM

Server; which

triggers interested

policies

4

Cisco EEM (Embedded Event Manager) is a Cisco lOS feature that allows end users to specify a condition to watch, and

write a policy to be carried out when that condition is met. It consists of various event detector modules that watch and report

events to the EEM server. The EEM server sends the event to the appropriate policy engine - either CLI applet, TCL (Tool

Command Language), or lOS shell (in newest lOS images). Some examples of event detectors are: Syslog, SNMP, Timer,

Interface Counter, CLI, OIR (On-line Insertion/Removal), Manual, lOS Watchdog/System Monitor, in which the Timer and the Manual detector are used in this Embedded Collector.

In general, EEM can be used in several areas: It can apply workarounds for problems discovered in the fields and

increase reliability by monitoring the system behavior and try to do fault detection, prevention and recovery. It provides

automation for management tasks by bundling mUltiple tasks and execute them automatically when a specific condition is

encountered. In the problem diagnosis area, its event detection capabilities can help to identify issues. If additional features or functionalities are needed after lOS releases, new logic can be added on top of IOS without image upgrade using EEM. EEM also

allows end users to modify user interface and do feature customization with external entities so that it satisfies a wider range or

customers.

An EEM policy is a script that is executed to carry out the desired actions. Those actions include: execute an lOS CLI

command and receive the result; Force a switch over to the standby in an SSO configuration; Request system information; Send

an email; Access SNMP data. Locally and remotely; Send XML-RPC requests; Send SNMP traps with custom data; Log a message to Syslog; Reload the box; Cause another EEM policy to be executed and Publish an application specific EEM event

EEM is normally used in a reactive mode, where some action needs to be automated in response to an event. The

Embedded Collector is essentially a suite of EEM policies that are replicated on devices that are designated as Collectors. The

policies interact with each other both on the collector devices, and between devices using secure protocols.

EEM policies can share data via the EEM's context mechanism. The context mechanism is a global TCL hash table built

from namespace variables which match a given regular expression. Master policies launch slave policies on the same device using the "action yolicy" extension, and slave policies pick up parameters and adjust counters by retrieving the context. Thus, the

Embedded Collector can use this mechanism as a crude semaphore. Each policy sleeps and loops until the context is available for

reading, which means another policy has finished updating it.

748

Assistant

Embedded Collection Flow

tel net or ssh for show commands

IEEE 1M 2011. Cisco Systems

Transport

Gateway

Collector

Yellow- Control Blue - Data

Inside Cisco

Firewall (i.e. DMZ)

5

The figure depicted above is the architecture of the collection in the Gateway and the Collector. The system is kicked off

via EEM timer event or via manual CLI invocation. A master scheduler policy parses a seed file and launches a collector control

policy for each device to be collected. A collector control policy determines if the collection is local or remote, and launches

SNMP and CLI collector policies appropriately.

Collector policies consult a local platform database - called the Inventory Control File(ICF) which returns the devices

OS, MIB OIDs, and CLls to collect. The Collector formats the output in a standard directory structure and archives the tree for

delivery. The Forwarder policy watches for collector completion, and sends the archive back to the gateway. The gateway

aggregates the returned inventories and re-archives the inventory for delivery to the back end, or numbers the incoming archive

and sends the partial inventory to the back end for re-assembly once all parts have been received or a timeout occurs.

The ICF is implemented as a collection ofTCL lists and arrays. One list contains all CLls understood by the back end,

and another list contains all SNMP OIDs. The three arrays are indexed by platform model number which are retrieved by the policies via SNMP query for the chassis model. The platform type array contains the platform OS. Two other arrays contain lists

of indexes into the CLI and SNMP OlD lists. List elements may be a discrete index or expressed as ranges, e.g. 20-37 to save

storage.

The following shows excerpts from the lists and arrays used in the ICF

set se_show_commands [list \

"show startup-config" \ "show running-con fig" \ "show version" \

The list elements in the above MIBS OlD list are lists themselves (a list of lists). The first element is the OlD of interest. The

second element is the MIB name. The third element is the expected prefix string returned by the SNMP Proxy CLI. This is done

so that the policy can determine the boundaries of the SNMP get-next or get-bulk operation. The final element is a key for the

type of query to be used S = single OlD to be retrieved, T = this OlD is column of a entire table to be retrieved.

array set platform_mib { {CE-550-DS3} {0-1620-37}

{CE-505} {0-1620-37}

{CE-507} {0-1620-37}

749

TCL Coding Standards and Techniques in an Embedded Environment

Clsco.com

EEM adds extensions to TCL 8.3.4 to query event info, launch policies, send syslog messages, etc.

EEM TCL library support is available for some common functions such as CLI, SMTP, and TCL global variable state check-pointing.

TCL allows all exceptions to be "caught"

Control policies are kept simple- "middle-man"

More complex collection policies are spawned per device, so that exceptions do not kill the entire collection

EEM environment variables provide a way for the end user to tune collector behavior, e.g. set timeouts, retries, debug level, etc.

IEEE 1M 2011. Cisco Systems

6

TCL has been used since the mid-90's in Cisco lOS for regression testing purposes. At the turn of the century, Cisco lOS started shipping with a TCL interpreter to support the Interactive Voice Response (IVR) feature. Shortly after, other Cisco lOS

TCL-enabled features were developed such as tclsh parser mode, Embedded Syslog Manager (ESM), Embedded Menu Manager

(EMM), and the main feature supporting the Embedded Collector, Embedded Event Manager (EEM).

Using TCL on Cisco lOS is a little more challenging than on a general purpose OS, in that on a general purpose OS, an

exception usually only results in the death of a single process. On Cisco classic lOS, the entire OS can be treated as a single process. Thus, of great concern are things such as infinite loops. Cisco lOS actually ensures there is no CPU hog in the case of

mal-written scripts - but the script will have to be killed by the EEM framework using the maximum runtime specification.

Another item to be mindful of is uncontrolled growth of lists and arrays. Since we want our TCL scripts in EEM policies

to run on a variety of platforms, with varying amounts of available RAM, we must ensure we periodically write to persistent

storage vs. consuming too much RAM. Which leads us to another area of concern - persistent storage. Unlike a general purpose

platform, most Cisco lOS platforms use flash memory vs. hard disks for persistent storage. Since flash memory can wear out after so many write cycles, we must also take care not to write unnecessarily often to the file system. Thus, there is a balancing act that

must be met between RAM consumption and writing to persistent storage.

With the above limitations in mind, putting the Embedded Collector together using a suite of EEM TCL-based policies

presents additional challenges. Again, every function in the Embedded Collector must handle exceptions in order to prevent the

exception from halting the policy completely. Static analysis tools can help spot unhandled exceptions. In addition to exception

handling, control policies are kept simple in order to act as "middle-men", in that their only purpose is to forward arguments and watch the collection policies. This allows the master policies to be insulated from scripts that hang due to communication faults,

or in case an unhandled exception slips through the cracks. EEM also allows for "environment" variables - which are stored in the

device's running-configuration and automatically passed to TCL-based policies as global TCL variables. This allows the

Embedded Collector to provide tunable parameters for customers to alter the behavior of the collector such as shorten or lengthen

time-outs as needed, adjust debug level, etc. This way, policy scripts can be pre-compiled from Cisco, but the behavior altered

somewhat without having to modify the script itself.

750

Script Obfuscation and Protection - -

Cisco lOS has a built-in byte-code loader which allows TCL-based features, such as EEM, to treat pre-compiled scripts

just as if they were plain text. This allows scripts to be obfuscated to discourage tampering, and protects intellectual property. Embedded Collector policies and library files are pre-compiled using the TclPro compiler.

The Signed TCL Scripts feature introduces security for the TCL scripts. This feature allows users to create a certificate to

generate a digital signature and sign a TCL script with that digital signature. The script is checked for a digital signature from

Cisco. In addition, third parties may also sign a script with a digital signature. If the script contains the correct digital signature, it

is believed to be authentic and runs with full access to the TCL interpreter. If the script does not contain the digital signature, the

script may be run in a limited mode, known as Safe TCL mode, or not run at all. After each routing device enrolls in a PKI, every peer (also known as an end host) in Public Key Infrastructure (PKI) is

granted a digital certificate that has been issued by a CA. When peers negotiate a secured communication session, they exchange

digital certificates.

A Rivest, Shamir and Adleman (RSA) key pair consists of a public key and a private key. When setting up your PKI, one

must include the public key in the certificate enrollment request. After the certificate has been granted, the public key is included

in the certificate so that peers can use it to encrypt data that is sent to the router. The private key is kept on the router and used both to decrypt the data sent by peers and to digitally sign transactions when negotiating with peers. RSA key pairs contain a key

modulus value. The modulus determines the size of the RSA key. The larger the modulus, the more secure the RSA key.

However, keys with large modulus values take longer to generate, and encryption and decryption operations take longer with

larger keys.

A certification authority (CA), also known as a trust point, manages certificate requests and issues certificates to participating network devices. These services (managing certificate requests and issuing certificates) provide centralized key

management for the participating devices and are explicitly trusted by the receiver to validate identities and to create digital

certificates. Before any PKI operations can begin, the CA generates its own public key pair and creates a self-signed CA

certificate; thereafter, the CA can sign certificate requests and begin peer enrollment for the PKI.

You can use a CA provided by a third-party CA vendor, or you can use an internal CA, which is the Cisco lOS

Certificate Server.

751

Collection Time

- -. • dUJIIIIIIUIIIIIIIJIIIIIIIIIII

Num Devices vis Collection Time per run

120

Xl 100

.� 80 I!l '5 60

B 40 E " z 20

0

0 5 10 15 20 25 30

Tot Collection Time per run (Min)

I--+--- Series 1 I

35

The GW and Collectors were C7206 NPE-G1 with at least 256DRAM and 128MB Flash. The CPU did not

exceed more than 12%.

Each Collector has 10 End devices. Thus collection is for: (1 GW + num_oCcoliectors + (num_oCcoll • 10

end_devices)

IEEE 1M 2011. Cisco Systems

Clsco.com

8

The above figure exemplifies an actual collection of a network with 100 devices. A 100 device collection takes

approximately 20-30 minutes, or approximately 200-300 devices per hour. The time will vary depending upon the device load,

and how many collection polices are allowed to run in parallel. The host device resource usage can be governed by using EEM parallel thread configuration. The target resource usage for inventory collection is < 10% CPU impact. It is suggested to schedule

a collection in off peak hours. Prior to starting any collection, the script based application first checks the CPU utilization of the

device over the last 5 minutes. If it exceeds a threshold then collection function is not performed.

Early performance of the Embedded Collector was initially in terms of hours, but several enhancements were made

during development. First, individual SNMP get-next queries were replaced with smarter logic that uses SNMP get-bulk queries to determine table size and quickly retrieve the entire table as quickly as possible. Second, inter-router policy launching timing

was improved. Custom policy spawning TCL procedures in which slave policies could notify master policies as soon as they were

successfully launched. Third, CLI and SNMP collection was separated into autonomous independent policies that can be queued

in any order by the EEM policy engine. This allows tuning of the EEM sessions scripting thread number per platform.

The Information collected includes:

eLI MIB

Show ver OldCiscoChassis

Show diag OldCiscoChassisCard

Show module CiscoStack

Show hardware Cisco Stack Module

Show inventory System

Show idprom all Old Cisco Sys

Show running-config Cisco Flash

Show startup-config Cisco Memory Pool

Inferface

IP Address

Entity

752

Device Performance Impact

Transaction Verify

CPU

utlliz.atlon

Gateway

10

Assign

Gateway

10

CPU utilization - cat6k 100

90 co 80 o 70

� �� :; 40

Verify Collector

30

ii'30 � u 20

10 0 ------------------------

_CPU utilization

Install

Collector

20

IEEE 1M 2011, Cisco Systems

Collection Tarring of Collection

30

Files Complete

30 10

Clsco.com

9

One of the primary goals of a low-touch Embedded Collector is its transparent operation. After initial setup, it does its

job with minimal impact to the target network. In a trade off, replacing the computing power of a dedicated appliance with a

collection of scripts that runs on the devices themselves is a challenge. To minimize the performance impact, the number of scripts that are allowed to run concurrently is controlled and set to a minimum as the default. EEM calls these session scripting

threads. EEM policies can be marked as members of scheduler classes, and each class can be throttled to a specific thread number. The diagram above shows a test run with a Catalyst 6000 as a Gateway at 10 concurrent EEM threads. There is an

approximate 20% rise in CPU usage during the heavy operations. This impact will be less when running on newer, higher power

devices, and more on low power, smaller devices. Remember that the EEM policies only run on collectors, not end devices, so on

average, only I in 10 devices will be impacted as the recommended collector to end-device ratio is 1: 10. Since the Embedded Collector is sharing resources with devices whose primary function is to route packets, the EC

scripts check for average CPU usage prior to launching a discovery or inventory collection. If the CPU usage is over 70%, the

Embedded Collector will not start, and a syslog message is generated stating the router was too busy.

If the CPU usage rises during the collection, the Embedded Collector takes advantage of the fact the EEM processes run at a lower priority than routing processes which are running at higher priority. This ensures routing is not interrupted. The Embedded Collector time-outs are set in terms of minutes, so momentary interruptions in collection are acceptable.

In summary the performance of a Embedded Collector depends on device characteristics: type of CPU on the router,

amount of DRAM and the number of modules and cards it supports. In a scenario where a larger router is designated as Gateway

with a powerful CPU and in excess of 2GB of DRAM, would able to handle required number of threads that use heap (i. e.

DRAM) without any concerns. Additionally, the execution of its scripts to collect data is faster and these routers also have higher

capacity flash to store the collected data. It is recommended that for achieving a good performance, a higher end router of given family to be used as a Gateway or a Collector.

753

Seed File and Auto Discovery Clsco.com

10

IEEE 1M 2011, Cisco Systems

Collection starts with a seed file. The seed file is a comma separated value (CSV) ASCII flat file which contains IP

address, Hostname, and access information for each device that need to be collected.

The downloaded helper application which was mentioned earlier also acts as a seed file editor if the users wishes to enter devices manually, make corrections and change login credentials. The helper application can also import other formatted

seed files from other network management application. The first user-defmed field contains the IP address of the device that

is to collect its information. The second user-defmed field contains the IP address of the device that acts as its gateway. Any

record with the gateway field populated is identified as a Collector. The advantage of using a seed file is complete control over which devices are collected. The disadvantage is maintaining

the seed file, especially if security policies dictate changing credentials periodically - which leads us to auto-discovery. If auto-discovery is desired, an optional set ofEEM policies can be installed which are launched ahead of the scheduled

collection to update the seed file. These policies use the CDP (Cisco Discovery Protocol) or LLDP (Link Layer Discovery

Protocol) to discover neighbors, and use known login credentials to perform a prerequisite check of newly discovered

devices. Candidates are forwarded back to the master gateway to be aggregated, and the master seed file updated. Once

discovery is complete, the collection is launched as usual.

754

Scale - Nested and Segmentation

Gateway A

GatewayB

Nested Segmented .,

11

IEEE 1M 2011. Cisco Systems

The major limitation to scale this solution is free local storage on the smart devices. Many customers have back up

images, old core dumps and other files on local storage, or have upgraded the Cisco lOS image over the years without corresponding upgrades to storage. Since the EEM policies must use Cisco lOS to archive the data, it is limited to "tar" command

in lOS, which does not allow for compression, only archiving. Also, the tar command itself must have free storage available for

temporary files during the archival process.

In order to scale to larger networks, the Embedded Collector can be architected such that any device with EEM feature

can act as a Gateway, Collector, or both. This allows Collectors and Gateways to be nested. The scenario is depicted on the left

side of this slide. Currently all collections for a given inventory must pass through the registered gateway and the inventory aggregated and sent to the back end as a single archive. In this case inventory archives are numbered so partial inventories can be

reassembled at the backend to alleviate local storage requirements

As depicted on the right side of this slide, networks can be also segmented to handle larger number of devices. Each

segment can have its own Gateway and be treated as a min-network. Each segment can be collected at different time, and the

backend application will assemble the collections together. Distributing the collection schedule is even more critical using the nested architecture, as the storage bottlenecks are the Gateways. Each collector has a retry mechanism in the case where Gateway

storage has been exhausted.

In either case, intelligence will be required on the destination application where the collection is shipped to for

aggregating the reports.

755

Distributed Computing and Error Recovery

ISCD.com

End Devices

Applications

IEEE 1M 2011, Cisco Systems

End Devices

End Devices

End devices for Collector 2 are

redistributed to Collector 3

12

Distributed computing refers to multiple autonomous computers that communicate through a network. According to the

Wikipedia, the definition of a distributed system is as follows: "A distributed system may have a common goal, such as solving a large computational problem. Alternatively, each computer may have its own user with individual needs, and the

purpose of the distributed system is to coordinate the use of shared resources or provide communication services to the

users."

To compare this approach with the distributed computing, following exam the mechanisms used to make up a distributed

computing system

1. No single point of failure: In this approach, the central commander is the Gateway, which commands to the collectors to start communicate to the end devices that were assigned to it to start the collections, and each collector start collecting

concurrently. The end device list is passed along with the start collection command to each collector. To achieve the

requirement where the system can tolerate failures in individual computers, when the Gateway device detects that a collector

cannot response back with the start collection command, or incapable or doing the collection, either due to device busy,

software or hardware failure, or scheduled maintenance downtime, the Gateway device can redistribute the collection task to other collectors. In the graph in the slide, Collector 2 is not able to start the collection, so the end devices belong to collector 2

will be redistributed to collector 3. Furthermore, when the application detects the scheduled collection failure, the application

can start a backup gateway to perform the gateway processing.

2. The structure of the network topology, network latency and number of computers is not known in advance. Using

auto discovery approach, there is no knowing of how many devices will be discovered in the network, neither the network

topology nor the network latency. The devices are connected thru internet and the collectors can be chosen randomly as long as the collector criteria can be met. The end devices can be assigned to the collector using the distance to the collector, i.e. it

is prefer to assign the end devices to a nearby collector, but this is not mandate, as long as the network connectivity exist

between the collector and the end devices, they can be collected. This implies that the network latency is also unknown.

3. Each collector only knows the end devices it needs to collect and the gateway devices to send the collection to, each

device does not know how many of the other collectors are there or how many total devices the system is trying to collect.

Each end devices only know the collector that is talking to it passively. So each computer only has a limited, incomplete view of the system.

756

Benefit

!::::::R=e=q::u= i= r=e=m=e="=t= =:::! Strategy No appliance needed Time to Market

Low Touch

Security

Performance

Data Driven collection

Auto collection

Scale

Embedded Scripts on the devices for the collection, no external appliance required.

The scripts solution does not require lOS code changes, quicker time to market.

No lOS upgrade needed, easier configuration and utility to allow customers to adapt to this approach. Low support cost.

Secure transfer of the collection from customer site to Application Backend. Collection is encrypted

Efficiently collect network devices, Low impact to the network devices on high end routers. Distributed collection approach

Profile based collection, tailor the collection to different devices data based

The collection can be scheduled and run automatically

Can be easily scale to larger network

IEEE 1M 2011. Cisco Systems

There are several benefits to this approach in comparison to other applications:

13

1) Most significant advantage of this approach is that there is no need for an external appliance, which is a major cost saving to the consumers. This mechanism is highly distributed thus no single point of failure. It is more cost-efficient to obtain the desired

level of performance by using several network devices or low end devices, in comparison with a single high-end appliance. It

satisfies the green criterion by virtue of not requiring an external appliance for data collection. Moreover, a distributed system

may be easier to expand and manage than a single appliance. 2) The script based solution is quicker to develop than developing

the feature in the operating system itself. The customer is usually reluctant to upgrade the image, if not totally resist it. Most customers are unwilling to upgrade their OS because it involves huge amount of testing. This approach will shorten the time to

market, in both development time and the time for customer to adopt. The short time to market benefit can also apply to upgrading. 3) The configuration and installation will be done by a single installation script with very few configuration

commands. It is a low touch model that does not require system engineers to visit onsite for the installation -- unlike the appliance

model, where a sales engineer typically has to travel on site to help with the installation and the initial collection. The scripts can

be downloaded from a central company site. With some instruction, customers will be able to install the scripts on their devices, and set up the scheduled collection. 4) Security of the data is of high concern from customers. To ensure the safety of their data,

the file transfer from the Gateway devices to the backend application is being taken care of by using secure protocols like SCP

and HTTPS. Also the data is encrypted before the transmission and TCL scripts are delivered in byte code format and signed to avoid tampering of the scripts either intentionally or unintentionally. 5) The collection is done by utilizing the computing power

on the network devices and the collection can be done in parallel, by concurrently collecting thru Gateway devices to the collectors and the collectors to the end devices. SNMP bulk collection is used and a profile approach to tailor the collected information accordingly to different device type is used to minimize the time. The performance impact to the devices is capped at less than 10% of CPU and memory usage. 6) The collection can be profiled to collect only the information pertinent to the type of

the devices to avoid collecting unneeded information, using a data driven approach. This data driven mechanism can be utilized for different applications that can use the collection, like monitoring, inventory or network management applications. 7) The approach can be started automatically and periodically. 8) The approach can be scaled for larger network by using mUltiple

gateways or using segmented collection approach. In contrast to external device based collection solution, the EC based collection technique not only works in conjunction with native OS (e.g. lOS) but efficiently uses device resource without

impacting its core functions such as routing, for example. The Embedded data collection solution has a good scope of scaling as a result of distributed nature design. However, the external device collector is likely to require memory and possibly CPU upgrade

to handle increase in network size. One of the challenges of the scaled EC design is that it has to have an in-built ability to handle new generation devices that become part of the network as it evolves.

757

Status and Future Work

Clsco.com

Expansion

14

IEEE 1M 2011, Cisco Systems

The Embedded Collector has been released to the field for market trial, the target users are small to medium size

customers with less than 100 devices. A lot of features can be derived from this approach.

Future enhancement includes: •

Scale to Larger network: the solution can be scaled to collect larger number of devices, using multiple Gateways or collect different segment of the network at one time and then assemble the collection together at the backend.

Auto discovery of the devices can utilize Cisco Discovery protocol, BGP or OSPF routing tables to discover all devices in the customer network. Or customers can specify number of hub that the discovery is limited to.

The profile that the collector collects can be a data driven approach not only limited to the developers, but also open up for the customers to change the profile of collection to fit their needs, by changing the meta data file of the

collections, for example, changing the MIBs name, and CLI commands. Once the devices are discovered, automatically assign the collectors to end devices to auto group them.

Further performance enhancement, currently to not impact much of the CPU and memory usage on the gateway and collector network device, the EEM and TCL is running at a low priority, Polling of the devices may be limited to xx

number/minutes .. Those can be made to configurable or intelligent adjust the polling interval so that the collection

time can be reduced.

The collection can be used to collect diagnostic information, for example certain log and error information. Logging information can be filtered on the devices and collected on demand or during scheduled collection.

Current release support Cisco applications, it can be integrated with other network management application or inventor or asset management applications.

The current release supports only the devices that respond to lOS CLI and SNMP commands. IT can be enhanced to support other OS, like UCS devices, Linux, or CatOS. Or support other 3rd party devices

758

Clsco.com

THANK YOU!

15

IEEE 1M 2011, Cisco Systems

Acknowledgement:

We would like to acknowledge Raja Banerjee, Subrata Dasgupta, Tim Johnson, Jim McDonnell, Sureshbabu Nagarathinam,

Chocks Ramiah, Ammar Rayes and Alex Truong for their contribution to the project and to the concepts and ideas presented in this paper. The Cisco EEM team for providing the device as programmability.

References:

[1] Ghosh, Sukumar (2007), Distributed Systems - An Algorithmic Approach, Chapman & Hall/CRC,

ISBN 978-1-58488-564-1.

[2] Lynch, Nancy A. (1996), Distributed Algorithms, Morgan Kaufmann, ISBN 1-55860-348-4. [3] Peleg, David (2000), Distributed Computing: A Locality-Sensitive Approach, SIAM,

ISBN 0-89871-464-8.

[4] lOS 12.4T User Guide: http://www.cisco.com/eniUS/docs/iosI12 _ 4t/netrngmt/configuration/guide/sign _tcl.htrnl, last accessed

1I311201l.

[5] Cisco Systems, Inc., Signed Tel Scripts,

http://www.cisco.com/eniUS/docs/iosI12 _ 4t/netrngmt/configurationlguide/sign _tcl.htrnl, last accessed 113112011. [6] Welch, Brent B. (2000), Practical Programming in Tel and Tk, Prentice Hall PTR, ISBN 0-13-022028-0.

U.S. Patent pending: System and Methodfor Providing a Script-Based Collection for Devices in a Network Environment, U.S.

Application Serial No. 12/848,146.

759