Network traffic classification: From theory to practice
Transcript of Network traffic classification: From theory to practice
![Page 1: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/1.jpg)
Network traffic classification: From theory to practice
Pere Barlet-Ros Associate Professor at UPC BarcelonaTech Co-founder and Chairman at Polygraph.io
Joint work with: Valentín Carela-Español, Tomasz Bujlow and Josep Solé-Pareta
![Page 2: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/2.jpg)
Background
• What do we refer to as traffic classification?
– Identifying the application that generated each flow
• What is traffic classification used for?
– Network planning and dimensioning
– Per-application performance evaluation
– Traffic steering / QoS / SLA validation
– Charging and billing
![Page 3: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/3.jpg)
State of the Art: Ports • Port-based
– Computationally lightweight – Payloads not needed – Easy to understand and program – Low accuracy and completeness
![Page 4: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/4.jpg)
State of the Art: DPI • Deep packet inspection (DPI)
– High accuracy and completeness – Computationally expensive – Needs payload access – Privacy concerns – Cannot work with encrypted traffic
![Page 5: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/5.jpg)
State of the Art: ML
• Machine Learning – High accuracy and completeness – Computationally viable – Payloads not needed – Can work with encrypted traffic – Needs retraining
![Page 6: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/6.jpg)
Main limitations of ML-TC
• Introduction in real products and operational environments is limited and slow – Current proposals suffer from practical problems
– Actual products rely on simpler methods or DPI
• We identified 3 main real-world problems 1) The deployment problem
2) The maintenance problem
3) The validation problem
![Page 7: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/7.jpg)
1) Deployment problem
• Current solutions are difficult to deploy
– Need dedicated hardware appliances / probes
– Need packet-level access (e.g. compute features, …)
• How to address this problem?
– Work with flow level data (e.g. Netflow)
– Support packet sampling (e.g. Sampled Netflow)
![Page 8: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/8.jpg)
NetFlow w/o sampling
• Challenge: NetFlow v5 features are very limited
– IPs, ports, protocol, TCP flags, duration, #pkts, …
• State-of-the-art ML technique: C4.5 decision tree
![Page 9: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/9.jpg)
Results (NetFlow w/o sampling)
• UPC dataset (publicly available) – 7 x 15 min traces from UPC access link
– Collected at different days and hours
– Labelled with L7-filter (strict version with less FPR)
![Page 10: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/10.jpg)
Results (Sampled NetFlow)
• Impact of packet sampling
![Page 11: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/11.jpg)
Sources of inaccuracy
1) Error in the estimation of the traffic features
2) Changes in flow size distribution 3) Changes in flow splitting probability
![Page 12: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/12.jpg)
Solution (Sampled NetFlow)
![Page 13: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/13.jpg)
Deployment problem: Summary
• Current proposals are difficult to deploy
• Proposed a simple but effective technique
– Supports standard NetFlow data
– Supports packet sampling
• Main limitation: Needs to be frequently retrained
V. Carela-Español, P. Barlet-Ros, A. Cabellos-Aparicio, J. Solé-Pareta. Analysis of the impact of sampling on NetFlow traffic classification. Computer Networks, 55(5), 2011.
![Page 14: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/14.jpg)
2) Maintenance problem
• Difficult to keep classification model updated
– Traffic changes, application updates, new applications
– Involve significant human intervention
– ML models need to be frequently retrained
• Possible solution to the problem
– Make retraining automatic
– Computationally viable
– Without human intervention
![Page 15: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/15.jpg)
Autonomic Traffic Classification
• Lightweight DPI for retraining
– Small traffic sample (e.g. 1/10000 flow sampling)
![Page 16: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/16.jpg)
Evaluation
• 14-days trace collected at CESCA
![Page 17: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/17.jpg)
Temporal/Spatial obsolescence
• Comparison without autonomic retraining
![Page 18: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/18.jpg)
Maintenance problem: Summary
• Exiting classifiers need periodic retrainings
– Temporal obsolescence: Changes in application traffic
– Spatial obsolescence: Different networks
• Autonomic traffic classification system
– Easy to deploy: Works with Sampled NetFlow
– Easy to maintain: Lightweight DPI for self-training
V. Carela-Español, P. Barlet-Ros, O. Mula-Valls, J. Solé-Pareta. An autonomic traffic classification system for network operation and management. Journal of Network and Systems Management, 23(3):401-419, 2015.
![Page 19: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/19.jpg)
3) Validation problem
• Current proposals are difficult to validate, compare and reproduce
– Private datasets
– Different ground-truth generators
• Our contribution
– Publication of labeled datasets (with payloads)
– Common benchmark to validate/compare/reproduce
– Validation of common ground-truth generators
![Page 20: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/20.jpg)
Proposal
• Reliable labeled dataset with full payloads
– Accurate: VBS (label from the application socket)
– Avoid privacy issues: Realistic artificial traffic
![Page 21: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/21.jpg)
Methodology
• Manually generate representative traffic – Create fake accounts (e.g. Gmail, Facebook, Twitter) – Interact with the service simulating human behavior
(e.g. posting, chatting, gaming, watching videos, …)
![Page 22: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/22.jpg)
Dataset
• > 750K flows, ~55 GB of data
![Page 23: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/23.jpg)
DPI tools compared
![Page 24: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/24.jpg)
Application protocols
![Page 25: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/25.jpg)
Applications
![Page 26: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/26.jpg)
Web services (summary)
• PACE: 16/34 (6 over 80%)
• nDPI: 10/34 (6 over 80%)
• OpenDPI: 2/34
• Libprotoident: 0/34
• L7-filter: 0/44 (high FPR)
• NBAR: 0/34
![Page 27: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/27.jpg)
Validation problem: Summary
• Comparison of most popular ground-truth generators – PACE: Best results at all classification levels – Libprotoident: Very good results at application/protocol – nDPI: Good results, web services level, open source – NBAR and L7-filter: Very poor results
• Dataset including payloads is publicly available – http://www.cba.upc.edu/monitoring/traffic-classification (Including also all other datasets presented in these slides) – Common benchmark to validate, compare and reproduce
T. Bujlow, V. Carela-Español, P. Barlet-Ros. Independent comparison of popular DPI tools for traffic classification. Computer Networks, 76:75-89, 2015. V. Carela-Español, T. Bujlow, P. Barlet-Ros. Is our ground-truth for traffic classification reliable? In Proc. of Passive and Active Measurement Conf. (PAM), 2014.
![Page 28: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/28.jpg)
Network Polygraph
• Addressed 3 practical problems – The deployment problem (Sampled Netflow)
– The maintenance problem (Autonomic retraining)
– The validation problem (Labeled payload traces)
• We identified interest in the market – We created a UPC spin-off: https://polygraph.io
– Several customers world-wide
P. Barlet-Ros, J. Sanjuàs, V. Carela-Español. Network Polygraph: A cloud-based network visibility service. In ACM SIGCOMM Conf., Industrial Demo, 2015.
![Page 29: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/29.jpg)
Why Network Polygraph?
• Other products are expensive and difficult to deploy – Can only be afforded by large operators, ISPs, …
– Large portion of the market are SMEs (>90% in EU)
• Our technology based on Sampled NetFlow only needs a small volume of traffic data – <0.5% of extra bandwidth usage
– Can be provided as a service from the cloud (SaaS)
![Page 30: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/30.jpg)
Visibility-to-cost ratio
cost
visibility
![Page 31: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/31.jpg)
Website + On-Line Demo
https://polygraph.io
![Page 32: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/32.jpg)
traffic volume, breakdown by application
![Page 33: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/33.jpg)
HTTP services
![Page 34: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/34.jpg)
top talkers (addresses, ports, autonomous systems)
![Page 35: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/35.jpg)
subnetwork-level bandwidth hogs
![Page 36: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/36.jpg)
traffic geolocation (origins & destinations)
![Page 37: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/37.jpg)
anomaly and attack detection with automatic baselining
![Page 38: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/38.jpg)
indexed traffic database for forensic analysis
![Page 39: Network traffic classification: From theory to practice](https://reader033.fdocuments.us/reader033/viewer/2022042922/626ab4141d33717a5a3d1ec1/html5/thumbnails/39.jpg)
Network Polygraph
Talaia Networks, S.L.
K2M – Parc UPC Campus Nord
Jordi Girona, 1-3
Barcelona (08034)
Spain
Telephone: +34 93 405 45 87
https://polygraph.io