Deploying and Managing a Network of Autonomous Internet ...

9
Research Article Deploying and Managing a Network of Autonomous Internet Measurement Probes: Lessons Learned Urban Sedlar, Miha Rugelj, Mojca Volk, and Janez Sterle Faculty of Electrical Engineering, University of Ljubljana, Trzaska Cesta 25, SI-1000 Ljubljana, Slovenia Correspondence should be addressed to Urban Sedlar; [email protected] Received 31 July 2015; Revised 2 October 2015; Accepted 11 October 2015 Academic Editor: Chenren Xu Copyright © 2015 Urban Sedlar et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We present our experience with design, development, deployment, and operation of a nation-wide network of mobile Internet measurement probes. Mobile internet is becoming increasingly important, with new technologies being deployed on a regular basis. In the last decade we have seen a transition over a multitude of data transmission technologies, from EDGE, UMTS, and HSPA to LTE, with 5G already on the horizon. is presents an extremely heterogeneous environment where problems are difficult to pinpoint, because of either the fast changing radio channel conditions, daily mobility of the user base, or a number of fallback technologies that are available at a specific location and chosen by the terminal itself. To quantify these conditions, we have developed a mobile internet measurement probe called QMON, which can be either statically deployed or used in drive measurements and is able to collect hundreds of key performance indicators on physical, network, and application layers of the network stack, acting at the same time as an event-driven real-time sensor network and a batch-mode detailed data collection device. In this paper we expose some of the design considerations and pitfalls; among them are the problems of managing and monitoring the remote probes that measure the same communication channel that is also used for their control. 1. Introduction In recent years, mobile Internet connectivity has become ubiquitous, enabling a plethora of new apps and services. Worldwide, deployments of 4G networks have significantly reduced the mobile data latencies and increased the available throughputs and are quickly closing the gap with fixed technologies [1]. With work on 5G technologies already underway, the future of mobile data seems brighter than ever—enabling everything from HD and Ultra HD video streaming and virtual reality on the go to connecting billions of devices to the Internet as a part of the Internet of ings (IoT) paradigm [2]. However, the reality of real-world mobile networks is more complex, further complicated by legacy system con- straints and a need for seamless transition between tech- nologies. Due to the fast progress of the latter, mobile networks are constantly evolving. is means that oſten many heterogeneous technologies coexist in the same location and in the same mobile system (e.g., 2G, 3G, and 4G) [3]. In addition, compared to fixed networks, which operate in a relatively stable environment [4], the last mile in mobile networks is a radio channel that can be subject to radio interference, coverage issues, or traffic spikes due to daily user commuting. is makes such networks increasingly complex [5] and difficult to tune, as it introduces a wide variety of physical layer parameters the operator needs to monitor and adjust when optimizing it [6]. Firstly, on the radio layer, multi- ple concurrently available network technologies on differ- ent frequency bands provide different channel bandwidths, capacities, and latencies and have different coverage and energy efficiency [7]. Next, on top of the radio layer, different network protocol stacks exhibit different performances (IPv4 versus IPv6, TCP and multipath TCP versus UDP, etc.). And lastly, what the user sees is only the performance of the applications and services, which is subject to yet another set of conditions (application behavior and optimizations, hardware capabilities, and server loads). To address this challenge, we have designed and imple- mented a distributed multilayer measurement system called QMON [8]. It is a cloud-managed system of probes that Hindawi Publishing Corporation International Journal of Distributed Sensor Networks Volume 2015, Article ID 852349, 8 pages http://dx.doi.org/10.1155/2015/852349

Transcript of Deploying and Managing a Network of Autonomous Internet ...

Page 1: Deploying and Managing a Network of Autonomous Internet ...

Research ArticleDeploying and Managing a Network of Autonomous InternetMeasurement Probes: Lessons Learned

Urban Sedlar, Miha Rugelj, Mojca Volk, and Janez Sterle

Faculty of Electrical Engineering, University of Ljubljana, Trzaska Cesta 25, SI-1000 Ljubljana, Slovenia

Correspondence should be addressed to Urban Sedlar; [email protected]

Received 31 July 2015; Revised 2 October 2015; Accepted 11 October 2015

Academic Editor: Chenren Xu

Copyright © 2015 Urban Sedlar et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We present our experience with design, development, deployment, and operation of a nation-wide network of mobile Internetmeasurement probes. Mobile internet is becoming increasingly important, with new technologies being deployed on a regularbasis. In the last decade we have seen a transition over a multitude of data transmission technologies, from EDGE, UMTS,and HSPA to LTE, with 5G already on the horizon. This presents an extremely heterogeneous environment where problems aredifficult to pinpoint, because of either the fast changing radio channel conditions, daily mobility of the user base, or a numberof fallback technologies that are available at a specific location and chosen by the terminal itself. To quantify these conditions,we have developed a mobile internet measurement probe called QMON, which can be either statically deployed or used in drivemeasurements and is able to collect hundreds of key performance indicators on physical, network, and application layers of thenetwork stack, acting at the same time as an event-driven real-time sensor network and a batch-mode detailed data collectiondevice. In this paper we expose some of the design considerations and pitfalls; among them are the problems of managing andmonitoring the remote probes that measure the same communication channel that is also used for their control.

1. Introduction

In recent years, mobile Internet connectivity has becomeubiquitous, enabling a plethora of new apps and services.Worldwide, deployments of 4G networks have significantlyreduced the mobile data latencies and increased the availablethroughputs and are quickly closing the gap with fixedtechnologies [1]. With work on 5G technologies alreadyunderway, the future of mobile data seems brighter thanever—enabling everything from HD and Ultra HD videostreaming and virtual reality on the go to connecting billionsof devices to the Internet as a part of the Internet of Things(IoT) paradigm [2].

However, the reality of real-world mobile networks ismore complex, further complicated by legacy system con-straints and a need for seamless transition between tech-nologies. Due to the fast progress of the latter, mobilenetworks are constantly evolving.Thismeans that oftenmanyheterogeneous technologies coexist in the same location andin the same mobile system (e.g., 2G, 3G, and 4G) [3]. Inaddition, compared to fixed networks, which operate in a

relatively stable environment [4], the last mile in mobilenetworks is a radio channel that can be subject to radiointerference, coverage issues, or traffic spikes due to daily usercommuting.

This makes such networks increasingly complex [5] anddifficult to tune, as it introduces a wide variety of physicallayer parameters the operator needs to monitor and adjustwhen optimizing it [6]. Firstly, on the radio layer, multi-ple concurrently available network technologies on differ-ent frequency bands provide different channel bandwidths,capacities, and latencies and have different coverage andenergy efficiency [7]. Next, on top of the radio layer, differentnetwork protocol stacks exhibit different performances (IPv4versus IPv6, TCP and multipath TCP versus UDP, etc.). Andlastly, what the user sees is only the performance of theapplications and services, which is subject to yet anotherset of conditions (application behavior and optimizations,hardware capabilities, and server loads).

To address this challenge, we have designed and imple-mented a distributed multilayer measurement system calledQMON [8]. It is a cloud-managed system of probes that

Hindawi Publishing CorporationInternational Journal of Distributed Sensor NetworksVolume 2015, Article ID 852349, 8 pageshttp://dx.doi.org/10.1155/2015/852349

Page 2: Deploying and Managing a Network of Autonomous Internet ...

2 International Journal of Distributed Sensor Networks

perform a wide range of measurements, some of whichare done by emulating the user activity on the applicationlayer. The system thus collects vast amounts of data onmultiple layers of the network stack. Special attention wasgiven to layer 7 web-based services and applications, withthe rationale that L7 metrics of modern services are mostindicative of the users’ level of satisfaction. In additionto quantifying the users’ level of satisfaction, the collecteddata also enables better understanding of the network andmakes future optimizations possible, on both the networkand physical layers. For example, it can be used to adaptivelyroute the traffic and perform the best path selection, provideconfiguration for a software-defined network (SDN), or guidethe providers in either manual or automated physical layeroptimizations [9].

In this paper we describe how we designed and imple-mented the system from the point of view of data captureand exchange and outline the practical experience we hadwith deployment and management of a number of probesdeployed at dozens of fixed locations as well as in vehicles forperforming the periodic drive tests.

2. Related Work

Many solutions exist for monitoring communications net-works and Internet services, both commercially and asresearch projects. Some crowdsourced application-layer solu-tions that have existed for long are Speedtest.net [10] andGrenouille [11]. For example, of a crowdsourced network-level solution, see RIPE Atlas [12].

However, to gain better insight into the network, thewhole stack needs to be taken into account and a cross-layer solution is needed, addressing physical, network, andapplication layers. In this area, due to necessary collection ofdata from mobile hardware, the number of solutions quicklyplummets. Apart from research efforts such as QoE Doctor[13], not many practical products exist to address the mea-surements and monitoring of the last generation commercialmobile networks. In addition, commercial solutions typicallylag behind the state of the art in consumer off-the-shelfmobile radio hardware, while their high price also preventsthe use of such equipment on a larger scale, for example,as a part of an IoT system. Only recently has the trend ofdeploying measurement and sensing probes on a large-scaleand using data analytics to gain insight become viable, owingprimarily to the lowering prices of the hardware components.For example, [14] describes a project named SITEWARE, adistributed network, and Quality of Service (QoS)/Quality ofExperience (QoE) measurement platform that leverages bigdata and allows network optimization. Our research and pre-sented solution is similar in the way it addresses cross-layerdata collection and measurements; however, it places morefocus on practical implementation and deployment withmobile operators. Our solution also tries to achieve a goodtrade-off between real-time awareness and the option of adetailed historical drill-down analysis, while the autonomousoperation in adverse network conditions requires it to have a

nonnegligible portion of the system intelligence on the probeitself.

3. System Constraints

Our guidelines and resulting constraints for the system wereprimarily inspired by the need for scalability, manageability,and autonomy, while achieving a trade-off between real-timefeedback and detailed drill-down analysis. Below we presentthem in more detail:

(i) Scalability: effortless deployment and distributedoperation enable low cost large-scale data collectionfor longitudinal case studies, trend analysis, andoptimization, which are simply not attainable withsmall or short-lived deployments.

(ii) Central control to orchestrate the network of agents:this allows it to perform both synchronized measure-ments and stress tests.

(iii) Agent autonomy and robustness: in case of possiblenetwork problems (which can be commonplace inmobile environments) the agent should have enoughintelligence to survive and continue doing usefulwork on its own.

(iv) The main functional requirement: it is related tothe mobile network monitoring use cases, namely,to be able to emulate user activity by using up-to-date software and services. This includes measuringresponse times and other KPIs on the applicationlayer, while at the same time collecting as manynetwork and physical layer parameters as possible.

(v) Hardware constraints: they provide a variety of radionetwork data parameters (using the modem as asensor), as well as USB and GPIO for connectingadditional sensors (GPS, temperature). With all otherthings being equal, we prefer off-the-shelf hardwareto drive down maintenance costs.

4. System Design

The architecture of the system is presented in Figure 1. Themeasurement probe hosts themain component of the system,the software agent that performs the measurements andcollects mobile system parameters. The agent interfaces withthe hardware to collect mobile radio parameters, performsuser emulation, and stores the results in an XML file. Forthe basic performance indicators, as soon as the results areobtained, they are pushed to a real-time event processingengine that allows monitoring of the system in real-time.When the entire measurement cycle is completed, the resultsare uploaded to a data server, where they are parsed, stored ina database, and enriched with GeoIP data and data from theARIN/RIPE database.

Belowwe further outline how the systemdesign addressessome of the key constraints and problems.

4.1. Probe Autonomy. Since the probes can be deployed inremote or even inaccessible locations, probe autonomy is

Page 3: Deploying and Managing a Network of Autonomous Internet ...

International Journal of Distributed Sensor Networks 3

Configuration manager

AgentAgent

Agent SW

Detailed batch data aggregation

Publicinternet

Managed servers HW and sensors

Result DB

External (online) data

sources

Data enrichment

jobs

Batch data analysis

(7) Enrichdata

(1) Send status infoand configuration

request(2) Deliver

configuration

HTTPS REST API

(3) Perform measurements

(4) Collect datafrom local hardware and sensors(modem, GPS, and internal HW)

(6) Submit results

SFTP with resume

Real-time event processing

engineEvent

DB

(8) Refreshdata

Config.DB

(5) Submit basic KPIs

Monitoring

Feedback loop

Figure 1: System architecture and main data flows. Green color denotes the monitoring subsystem.

of the highest importance. This is further complicated bythe inherent planned communication blackouts to avoidinterference with Internet measurements.This makes remotedecisionmaking in the cloud impractical, and the agent needsto be appropriately smarter, allowing it to handle differentsoftware errors and hardware malfunctions. The latter turnsout to happen quite often with commercial firmware of thelatest generation of modern peripherals.

Another desired property of an autonomous agent is theability to operate without the visibility of Command andControl (C&C) servers. This might seem to be a primitiverequirement in the era of always-on broadband and redun-dancy built into the data centers, but mobile coverage is oftenfar from perfect, especially in rural areas. Borders of whitespots with nomobile coverage are one of the most interestingspots for an operator or a regulator to measure and monitor,making the need for autonomous operation the norm ratherthan the exception.

Thus, to satisfy both requirements the remote config-uration and agent operation roughly follow the procedureoutlined in the diagram in Figure 2.

4.2. Updating and Rollback. To provide bug fixes and seam-less roll-out of new features, an updating mechanism isneeded. However, it has to be designed in the simplestpossible manner, so as to be able to make it as bug-freeand fool-proof as possible. If the agent software is corruptedremotely, either it has to be able to roll-back the destructivechanges itself, or manual intervention is needed which canoften be costly.

We addressed this as follows. The agent first fetches theupdate package; upon unpacking it, an independent processis started, which performs integrity checks, terminates the

agent, moves the old files to a safe location, unpacks thearchive, and tries to start the agent. If the agent starts,it terminates the update process; otherwise, if the updateprocess has not been terminated after a certain amount oftime, it assumes something went wrong and performs arollback by deleting newly extracted files, copying the old filesback, and restarting the old version.

4.3. System Scalability. Designing the probe with anautonomous agent worker makes the scalability of the sensornode trivial; all operations on the side of the agent can runindependently and thus concurrently. Probe deployment isfurther facilitated by the unique agent GUID (see Figure 2)which is generated upon first start and is used for agentidentification in all subsequent communication.

Thus, only the remote configuration (Figure 1, steps (1)-(2)), server-side data collection (Figure 1, steps (5) and (6)),and data processing (Figure 1, step (8)) need to be scaledwhen the number of probes is increased. Although scaling insteps (2), (5), and (6) can be achieved by balancing the loadover multiple servers, database query caching can providesignificant resource savings in step (2) as well. Finally, ascalable data storage technology should be used in the systemas well, in the form of either an eventually consistent NoSQLdatabase, or a sharded relational database.

5. Implementation

Theprobe is implemented on an industrial x86 computerwiththe latest generation of mobile modem hardware. Based onthe LinuxOS and the agent software developed in Python, theagent fetches the data over a RESTAPI from themanagement

Page 4: Deploying and Managing a Network of Autonomous Internet ...

4 International Journal of Distributed Sensor Networks

Agent init.

Generate and store unique GUIDFirst run Yes

Load GUID

No

Report status and fetch config. based

on GUID

Store config.

C&C server accessible

YesPrevious

config. exists

No Load previous config. Wait

No

Open and parse config.

Yes

Perform measurement 1

Perform measurement N

Batch upload results

Store partial results

Store partial results

Archive results locally

Trigger KPI event

Trigger KPI event

· · ·

Figure 2: Agent autonomous operation flow chart.

server and performs the measurements in a cycle, using avariety of protocols and methods.

On the physical layer, the following parameters andmetrics are collected directly from the modem hardware:

(i) Radio access technology used (LTE, HSPA, UMTS, orEDGE).

(ii) Received Signal Strength Indicator (RSSI).(iii) Transmit power.(iv) Radio state (RRC idle or RRC active).(v) Multiple HSPA-specific parameters: frequency band

and channel, receiving levels for individual carriers.(vi) Multiple LTE-specific parameters: signal-to-interfer-

ence-plus-noise ratio (SINR), frequency band, chan-nel bandwidth, Reference Signal Received Power

(RSRP), Reference Signal Received Quality (RSRQ),Tx and Rx channels, and so forth.

(vii) Various mobile channel identifiers (operator name,operator code, cell ID, local area code, tracking areacode, etc.).

Collection of these parameters allows the operator to cap-ture exact mobile radio context of the measurements; sincemany of the parameters (including the technology and thebase station) change due to environmental factors or usermobility, this can give insight into any sudden performancedegradation.

On the network layer the metrics are obtained by activelygenerating service and application traffic between the probeand the network elements or servers either in the operator’snetwork or in the public Internet. In some cases, the amountof traffic generated is limited only by the link speed andmeasurement duration, which can lead to significant mobiledata usage; when such high-throughput measurements arerequired, their frequency can be adjusted to reduce theamount of traffic transferred. The following metrics arecaptured on the network layer:

(i) ICMP ping to estimate round-trip times (RTT) on thenetwork level.

(ii) Traceroute to determine the measured path of theindividual measurements.

(iii) Uplink and downlink throughputs using both TCPand UDP traffic.

On the application layer the user actions are again activelyemulated, using either actual applications or headlessapplication-scripting frameworks:

(i) HTTP and FTP upload and download throughputs.(ii) DNS resolution times.(iii) Web-based metrics, performed by emulating user

requests for page loads. Duration needed for the pageto load that is captured, as well as the total payloadof the page, number of connections, and connectiontimes.

(iv) In addition, based on the duration of web pagerequests, Mean Opinion Score (MOS) for web brows-ing that is also calculated.

Finally, contextual data, such as hardware temperature,system load, and probe location are collected. Locationinformation is obtained from a GPS receiver module, but, incase of indoor use, one of the possible alternatives could alsobe WiFi positioning [15].

A number of open source projects were used to gatherthe described metrics and KPIs, ranging from GNU ping,traceroute, wget, and iperf, to the scriptable browser enginePhantomJS [16].

A significant part of the network and application-leveltests could have also been implemented as a smartphoneapp, making the solution more cost-efficient and easier todeploy. However, the amount of physical layer indicators and

Page 5: Deploying and Managing a Network of Autonomous Internet ...

International Journal of Distributed Sensor Networks 5

metrics that can be acquired on such devices is very limitedin comparison to dedicated mobile modem hardware. Themain reason for this is that the sandboxed environment ofmobile apps does not allow the necessary low-level accessto the hardware. At the same time, mobile phone hardwarewould be very heterogeneous, offering nonuniform sets ofcapabilities and incomparable datasets.

6. Data Processing Pipeline

Data processing combines two independent data pipelinesthat serve different purposes: (1) an event-driven pipeline thatis based on real-time KPI delivery, and (2) a batch postpro-cessing pipeline based on commercial Business Intelligence(BI) tools.This combination allowed us to get the best of bothworlds: fast response for basic metrics and big-data-sizeddetailed results for drill-down analysis. Such approach provedextremely valuable due to the fact that the network resourcesare completely unavailable when measurement process is inprogress and can be sporadically constrained otherwise dueto poor coverage.

6.1. Real-Time Event Processing. The purpose of real-timeprocessing is to be able to monitor the state of the mobilenetworks in real-time [17]. This is useful either for operatorswhen making changes to the network, or while driving toget the instantaneous feedback on the coverage, networkparameters, and system behavior.This can be seen in Figure 3(Figure 3(d) showing the real-time status in the mobile appandFigure 3(c) showing theweb-based real-time dashboard).

Real-time event processing fuses historical data, KPIevents delivered by the agent, and the information stored inthe database of the C&C server (fetched over an HTTP API).For triggering alarms based on the combination of events,a rudimentary complex event processing (CEP) engine withwindowing support was deployed that checks for concurrentoccurrence of multiple conditions and tracks moving aver-ages of KPIs.

6.2. Batch Processing and Data Analysis. The detailed resultsare delivered in the XML format after the measurement cycleis completed.Theuse of plaintextXMLfiles simplifies the datamanagement, resubmission of failed files, and checksummingand allows manual collection of large amounts of data whenuplink is not adequate.Uploaded files are regularly parsed to adatabase and archived as the reference copy, making it easy toreplay events on a different server or to track down problemswith the data processing chain.Once parsed, the data is storedin a database, which serves as a data source for import into theBI solution. However, such multistaged process means thatin the worst case the generated reports lag a couple of hoursbehind the state of the network.

6.3. Data Processing Pipeline Limitations. Thesize of collectedresult-sets can range from 1KB per measurement cycle (e.g.,a single measurement with results and metadata) to multiplemegabytes per cycle (for hundreds of measurements). Thesize of the results is in addition to the amount of reported

details also correlated with the duration of the cycle. Theaverage data rate per agent in our deployment is thus cur-rently at around 3 kbit/s per probe, with log data accountingfor another 8 kbit/s. At this rate, a single upload serverwith a gigabit network interface could accommodate tensof thousands of probes. On the other hand, the amountof data collected grows quickly, with 1 GB/probe/month forresults data and another 2.5 GB/probe/month for logs (whereenabled).We have addressed this with sharding of results andby discarding the old log files.

However, due to the constraints of the measurementprocess, which should not be interrupted by the transmissionof the results, the available slots for communication withthe backend can easily be underallocated, especially on slowmobile links; this requires special caution when configuringthe system, to achieve the right balance between themeasure-ment and results upload time slots.

6.4. Experimental Results. Two examples of measured resultsobtained by the system are shown in Figure 5.Themap on theleft depicts an area of the city of Ljubljana, together with thelocations of mobile cell towers. Red and green dots representmeasurements obtained during a drive test, with the colorindicating the LTE Reference Signal Received Power (RSRP).The circled area of the city clearly exhibits low RSRP. Themetadata associated with each measurement can be usedto determine which base station is serving the area, whatfrequency and frequency band were being used, the speedof the vehicle when the measurement was performed, andif a handover between base stations occurred. In addition,the effect of poor signal strength on the actual download andupload speeds can be estimated, determining how the userperceives the problem.

The right side of Figure 5 shows a cumulative distribu-tion function (CDF) and a histogram for LTE download(DL) and upload (UL) speeds. It can be seen that higherspeeds were achieved at 1800MHz using a 20MHz band(upper blue histogram) than at 800MHz using a 10MHzband (bottom blue). Similarly, upload speeds were higherat 1800MHz using a 20MHz band (upper red histogram)than at 800MHz using a 10MHz band (bottom red). Thisis expected behavior; however, it shows how varied the rootcauses for poor performance can be: in the range of theradio access technology used (LTE orHSPA,UMTS or EDGEfallback), actual poor signal strength, handovers due to usermobility, used frequency and bandwidth of the base station,and congested or misconfigured network core or overloadedapplication servers. By collecting as much context data aspossible for each measurement, the main reasons for poorperformance can be quickly discovered and addressed.

7. Management and Monitoring

7.1. Remote Agent Management. Monitoring and remotemanagement of probes with the primary purpose of networkmeasuring are especially challenging, since the communica-tion channel and themeasurement channel are the same, evenmore so because the channel needs to be torn down and set

Page 6: Deploying and Managing a Network of Autonomous Internet ...

6 International Journal of Distributed Sensor Networks

(a) Agent management (b) Agent deployment

(c) Real-time KPI analysis (d) Agent status

10 20 10

Ope

rato

r#2

Ope

rato

r#1

−50

0

−10

−5

0

−60−40−20

0

0

10

20

23.48832.883

41.356

15.10025.250

−84 −86−91 −95 −85

−8 −8−9 −9

−10

−62 −63 −57−67 −60

191516 18

6

Download speed per banddl_speed_net_kbps (bin)

0

50

100

(%)

0

50

100

(%)

0

50

100

(%)

0

50

100

(%)

0

0

0

0

200

400

600

200

400

600

200

400

600

200

400

600

355419

182745671117132

200219178173 9634 51 4

79751520414680108

281298185

17 1952363538417411495961138989473 7 2

64 285085162

270406452506

627577436

200

2

Download speed distribution per band

17072

Med

ian

SIN

RAv

erag

e dow

nloa

dM

edia

n RS

RPM

edia

n RS

RQM

edia

n RS

SI

−100

(B3) 1800MHz

(B3

)1800

MH

z(B3

)1800

MH

z

(B20) 800MHz

(B20

)800

MH

z(B20

)800

MH

z

(dB)

(dBm

)(d

B)(d

Bm)

Operator#1

Operator#1

Operator#2

Operator#1

Operator#2

spee

d

85

K80

K75

K70

K65

K60

K55

K50

K45

K40

K35

K30

K25

K20

K15

K10

K5

K0

K

40K

20K

0K

(e) Advanced KPI analysis

Figure 3: System implementation building blocks: (1) agent management Web GUI; (2) industrial-grade probe hardware; (3) mobiledashboard displaying agent status for drive measurements; (4) real-time dashboard displaying measured KPIs; (5) drill-down analytics usinga professional BI tool.

up multiple times when changing measurement parameters(e.g., switching between mobile technologies) or measuringthe time needed to connect to the network.

We have solved this in the following ways:

(i) No unnecessary services are running that could causeunwanted traffic while measurements are in progress.

(ii) No persistent connection or tunnel is establishedbetween the agent and the server-side, which ensuresthe measurements are not influenced by monitoringand remote control traffic.Therefore, all management

and remote control must be initiated from the agentduring the idle period.

However, this puts a lot of responsibility on the agent itself,since it must remain completely autonomous and capable ofrecovering from any possible situation. For this reason, wehave implemented an internal state machine that retries cer-tain actions (e.g., mobile modem commands, if the modemfails to connect to the network with the desired technology)or initiates a soft reboot. If the modem drivers hang thesystem irrecoverably, a hardware watchdog initiates a powercycle.

Page 7: Deploying and Managing a Network of Autonomous Internet ...

International Journal of Distributed Sensor Networks 7

1 2 3 4 5 6 7 8 9 10 282726252423222120191817161514131211

Day of month

0K

5K

10K

15K

20K

25K

30K

Cou

nt o

f num

ber o

f rec

ords

Figure 4: Inferring operation anomalies from the number of log events for configuration requests; February 7–10 clearly shows a peak whichhad to be investigated. Such occurrences found in historical data can serve to train CEP filters to detect similar anomalies in real-time in thefuture.

Finally, for the cases when a remote and inaccessibleprobe gets completely cut off the Internet, we have added anout-of-band communication channel that leverages the samehardware, SMS textmessage.This allows limited communica-tion when the probe is completely stuck without connectivityand manual intervention is needed.

7.2. Monitoring. Remote monitoring happens on four dif-ferent fronts, which are leveraging the data connected atdifferent points in the system, as follows:

(i) Configuration C&C dispatcher receives an eventwhen an agent fetches a configuration; the requesthappens over an HTTPS get, and the same request isused to piggyback some metadata about the currentstate of the probe and the agent software. In thismanner, vital statistics about the CPU, memory, anddisk utilization are collected, as well as a number ofcontrolled (intentional) and uncontrolled (uninten-tional) reboots.

(ii) Real-time dashboard receives events for basic KPIresults; this allows the user to monitor the state ofthe network, but the presence, frequency, and payloadof the events also carry the information about themeasurement system itself, which makes it a valuablesource for monitoring.

(iii) The detailed XML results include over 500 differentKPIs covering physical, network, and applicationlayer. Since these result files can become large, FTPwas used as a protocol for syncing the measuredand gathered data with the server. This can only beprocessed in a batch mode, as the results are pushedto the server only after the measurement cycle iscompleted.

(iv) Similarly, the log agent files containing different levelsof debug information are as well pushed to the server,

but only after a cycle is completed; this, however, isuseful only for historical trending, analysis, and rootcause discovery, due to time lag.

Thus, the monitoring process is multistaged. Firstly, theprevious conditions of the probe are updated, when theagent fetches configuration, at the same time reporting thepreviously dispatched configuration, CPU, memory, and diskutilization, as well as their historical trend. Next, when basicKPIs are delivered immediately after the measurement, thisserves as an agent heartbeat. Finally, when the entire mea-surement cycle is complete, the results and logs are uploadedin their entirety. These can be imported into an analytics tooland used to discover past trends and anomalies, which serveto guide development of further features, creation of filters,and setting the thresholds for alarms. For example, the peakin Figure 4 shows an anomaly detected from historical logs ofthe management server (configuration dispatcher): a singlerogue probe was hitting the server with thousands of con-figuration requests per day due to an unhandled exception,which caused it to abort themeasurements.Occurrences suchas this can serve to train the complex event detection engine,which is then able to detect similar anomalies as they happen,reducing the time needed to respond and fix the problem.

8. Conclusions

In this paper we presented the design of a distributed mul-tilayer mobile network monitoring solution that is remotelymanaged and operates as a wide-area sensor network. Theprobes push the data in twomodes, basic real-time indicatorsfor instantaneous feedback and detailed measurement resultsthat allow the operator to gain deep insights into the mobilenetwork. The solution has been deployed and is successfullybeing used or evaluated bymultiplemobile network operatorsin Central and Eastern Europe.

Page 8: Deploying and Managing a Network of Autonomous Internet ...

8 International Journal of Distributed Sensor Networks

Median start radio RSRP (dBm)

−124.00 −50.00

(a)

UL

CDF

grap

hCD

F gr

aph

0

50

1000

50

1000

50

1000

50(%)

(%)

(%)

(%)

100

150

Hist

. gra

ph

25 35 2868 59

3383 86

30 303412 32 323 362

2913 33 27 17 17 1316 16 8 42 2 2 1170

11465 65 4933 23

7611

55 6544 341

050100150

Hist

. gra

ph

050100

050100

150

Hist

. gra

ph

B3(1800

MH

z)B2

0

(800

MH

z)

050100150

Hist

. gra

ph

DL

CDF

grap

hCD

F gr

aph

B3B2

0

(800

MH

z)(1800

MH

z)20

MH

z10

MH

z20

MH

z10

MH

z

85

K80

K75

K70

K65

K60

K55

K50

K45

K40

K35

K30

K25

K20

K15

K10

K5

K0

K

(b)

Figure 5: Two examples of the results obtainedwith the presentedmeasurement system. (a) Amap of Ljubljana with points indicating a probeduring a drive test. The color of the points represents LTE Reference Signal Received Power (RSRP), while the map also shows location of thedeidentified mobile cell sites. Dashed red ellipse indicates an area of low RSRP signal strength. (b) Cumulative distribution function (CDF)and histogram charts for mobile LTE download (blue) and upload speeds (red); both are shown at different LTE frequency bands (1800MHzusing the 20MHz band—upper blue, upper red—and at 800MHz at 10MHz band—lower blue and lower red charts).

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper.

Acknowledgments

The research and development work was in part supportedby the European Commission (CIP-ICT-PSP-2011-297239),the Slovenian Research Agency under Grant Agreements P2-0246, L2-4289, and L7-5459, and Internet Institute, Ltd.

References

[1] P. Bhat, S. Nagata, L. Campoy et al., “LTE-advanced: an operatorperspective,” IEEE CommunicationsMagazine, vol. 50, no. 2, pp.104–114, 2012.

[2] H. Kopetz, “Internet of things,” inReal-Time Systems, 323, p. 307,Springer, New York, NY, USA, 2011.

[3] S. Y. Hui and K. H. Yeung, “Challenges in the migration to 4Gmobile systems,” IEEE Communications Magazine, vol. 41, no.12, pp. 54–59, 2003.

[4] B. Peternel and A. Kos, “Broadband access network planningoptimization considering real copper cable lengths,” IEICETransactions on Communications, vol. 91, no. 8, pp. 2525–2532,2008.

[5] S. Geirhofer and P. Gaal, “Coordinatedmulti point transmissionin 3GPP LTE heterogeneous networks,” in Proceedings of theIEEE Globecom Workshops (GC Wkshps ’12), pp. 608–612,Anaheim, Calif, USA, December 2012.

[6] G. Gomez, J. Lorca, R. Garcıa, and Q. Perez, “Towards a QoE-driven resource control in LTE and LTE-A networks,” Journal ofComputer Networks and Communications, vol. 2013, Article ID505910, 15 pages, 2013.

[7] I. Humar, X. Ge, L. Xiang, M. Jo, M. Chen, and J. Zhang,“Rethinking energy efficiency models of cellular networks withembodied energy,” IEEE Network, vol. 25, no. 2, pp. 40–49, 2011.

[8] QMON project webpage, http://www.qmon.eu.[9] A. Imran andA. Zoha, “Challenges in 5G: how to empower SON

with big data for enabling 5G,” IEEE Network, vol. 28, no. 6, pp.27–33, 2014.

[10] Speedtest.net webpage, July 2015, http://www.speedtest.net/.[11] Grenouille project webpage, 2015, http://grenouille.com/.[12] RIPE Atlas—RIPENetwork Coordination Centre, 2015, https://

atlas.ripe.net/.[13] Q. A. Chen, H. Luo, S. Rosen et al., “QoE doctor: diagnosing

mobile app QoE with automated UI control and cross-layeranalysis,” in Proceedings of the Conference on Internet Mea-surement Conference (IMC ’14), pp. 151–164, ACM, Vancouver,Canada, November 2014.

[14] H. Yin, Y. Jiang, C. Lin, Y. Luo, and Y. Liu, “Big data:transforming the design philosophy of future internet,” IEEENetwork, vol. 28, no. 4, pp. 14–19, 2014.

[15] L. Vidmar, M. Stular, A. Kos, and M. Pogacnik, “An automaticWi-Fi-based approach for extraction of user places and theircontext,” International Journal of Distributed Sensor Networks,vol. 2015, Article ID 154958, 15 pages, 2015.

[16] PhantomJS project webpage, July 2015, http://phantomjs.org/.[17] E. Alevizos and A. Artikis, “Being logical or going with the

flow? A comparison of complex event processing systems,”in Artificial Intelligence: Methods and Applications, vol. 8445of Lecture Notes in Computer Science, pp. 460–474, Springer,Berlin, Germany, 2014.

Page 9: Deploying and Managing a Network of Autonomous Internet ...

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

Journal ofEngineeringVolume 2014

Submit your manuscripts athttp://www.hindawi.com

VLSI Design

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

DistributedSensor Networks

International Journal of