Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.
-
Upload
eric-johnston -
Category
Documents
-
view
219 -
download
1
Transcript of Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.
![Page 1: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/1.jpg)
Update onScalable SA Project
#OFADevWorkshop
Hal RosenstockMellanox Technologies
![Page 2: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/2.jpg)
2
The Problem And The Solution
• SA queried for every connection• Communication between all nodes creates an n2
load on the SA• In InfiniBand architecture (IBA), SA is a centralized entity
• Other n2 scalability issues– Name to address (DNS)
• Mainly solved by a hosts file
– IP address translation• Relies on ARPs
• Solution: Scalable SA (SSA)– Turns a centralized problem into a distributed one
March 30 – April 2, 2014 #OFADevWorkshop
n^2 SA load
![Page 3: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/3.jpg)
3
Analysis
March 30 – April 2, 2014 #OFADevWorkshop
SM SA
500 MB
1.6 billionpath records
40,000 nodes
50k queries per second
~ 9 hours
~ 1.5 hourscalculation
![Page 4: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/4.jpg)
4
SSA Architecture
March 30 – April 2, 2014 #OFADevWorkshop
Localized caching
Data Processing
Database replication
ManagementCore
Distribution
Access
Client Client
Access
Client Client
Distribution
Access
Client
![Page 5: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/5.jpg)
5
Distribution Tree
• Built with rsockets AF_IB support• Parent selected based on “nearness” based on
hops as well as balancing based on fanouts
March 30 – April 2, 2014 #OFADevWorkshop
![Page 6: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/6.jpg)
6
rsockets AF_IB rsend/rrecv performance• On “luna” class machines as sender and receiver
with 4x QDR links and 1 intervening switch– 8 core Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
• Default rsocket tuning parameters• No CPU utilization measurements yet• SMDB: ~0.5 GB (for 40K nodes)
March 30 – April 2, 2014 #OFADevWorkshop
Data Transfer Size in Bytes Elapsed Time
0.5 GB 0.669 seconds
1.0 GB 1.342 seconds
![Page 7: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/7.jpg)
7
Distribution Tree
• Number of management nodes needed is dependent on subnet size and node capability (CPU speed, memory)– Combined nodes
• Fanouts in distribution tree for 40K compute nodes– 10 distribution per core– 20 access per distribution– 200 consumer per access
March 30 – April 2, 2014 #OFADevWorkshop
![Page 8: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/8.jpg)
8
Core Layer
March 30 – April 2, 2014 #OFADevWorkshop
SM SM’
Nodes join SSA tree
Core found at SM LID
raw SM DB SSA DB
extraction and comparison
Manage SSA group- distribution control- monitoring- rebalancing
![Page 9: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/9.jpg)
9
Core Performance
• Initial subnet up for ~20K nodes fabric– Extraction: 0.228 sec– Comparison: 0.599 sec
• SUBNET UP after no change in fabric– Extraction: 0.152 sec– Comparison: 0.100 sec
• SUBNET UP after single switch unlink and relink– Extraction: 0.190 sec– Comparison: 0.865 sec
• Measurements above on Intel(R) Xeon(R) CPU E5335 @ 2.00GHz 8 cores & 16G RAM
March 30 – April 2, 2014 #OFADevWorkshop
![Page 10: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/10.jpg)
10
Distribution Layer
March 30 – April 2, 2014 #OFADevWorkshop
SM SM’
Transaction log- incremental updates- lockless
Data agnosticDistributes SSA DB
- relational data model- data versioning (epoch value)
![Page 11: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/11.jpg)
11
Access Layer
March 30 – April 2, 2014 #OFADevWorkshop
SM SM’Epoch value- lightweight notification- minimal job impact
Data aware
Formats data- select SA queries- higher-level queries
![Page 12: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/12.jpg)
12
Access Layer Notes
• Calculates SMDB into PRDB on per consumer basis– Multicore/CPU computation
• Only updates epoch if PRDB for that consumer has changed
March 30 – April 2, 2014 #OFADevWorkshop
![Page 13: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/13.jpg)
13
Access Layer Measurements/Future Improvement(s)
• Half world (HW) PR calculations for 10K node simulated subnet
• Using GUID buckets/core approach, parallelizing HW PR calculation works ~16 times faster on 16 core CPU– Single threaded takes 8 min 30 sec for all nodes– Multi threaded (thread per core) takes 33 seconds– Parallelization will be less than linear with CPU cores
• Future Improvement(s)– One HW path record per leaf switch used for all the hosts that
are attached to the same leaf switch
March 30 – April 2, 2014 #OFADevWorkshop
![Page 14: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/14.jpg)
14
Compute Nodes (Consumer/ACM)
March 30 – April 2, 2014 #OFADevWorkshop
SM SM’Localized cache- compares epoch- pull updates
Integrated with IB ACM- via librdmacm
Publish local data- hostname- IP addresses
![Page 15: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/15.jpg)
15
ACM Notes
• ACM pulls PRDB at daemon startup and when application is resolving routes/paths– Minimize OS jitter during running job
• ACM is moving to plugin architecture– ACM version 1 (multicast backend)– SSA backend
• Other ACM improvements being pursued– More efficient cache structure– Single underlying PathRecord cache ?
March 30 – April 2, 2014 #OFADevWorkshop
![Page 16: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/16.jpg)
16
Combined Node/Layer Support
• Core and access• Distribution and access
March 30 – April 2, 2014 #OFADevWorkshop
![Page 17: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/17.jpg)
17
Reliability
March 30 – April 2, 2014 #OFADevWorkshop
SM SM’Local databases- log files for consistency
Primary and backup parents
Error reporting- parent notifies core of error
![Page 18: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/18.jpg)
18
System Requirements
• AF_IB capable kernel– 3.11 and beyond
• librdmacm with AF_IB and keepalive support– Beyond 1.0.18 release
• libibverbs• libibumad
– Beyond 1.3.9 release
• OpenSM– 3.3.17 release or beyond
March 30 – April 2, 2014 #OFADevWorkshop
![Page 19: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/19.jpg)
19
OpenMPI
• RDMA CM AF_IB connector contributed to master branch recently– Thanks to Vasily Filipov @ Mellanox – Need to work out release details
• Not in 1.7 or 1.6 releases
March 30 – April 2, 2014 #OFADevWorkshop
![Page 20: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/20.jpg)
20
Deployment
March 30 – April 2, 2014 #OFADevWorkshop
SM SA
IB ACMShipped by distros
IB SSACore package
IB SSADistribution
package
Mgmt Nodes
Compute Nodes
![Page 21: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/21.jpg)
21
Project Team
• Hal Rosenstock (Mellanox) - Maintainer• Sean Hefty (Intel)• Ira Weiny (Intel)• Susan Colter (LANL)• Ilya Nelkenbaum (Mellanox)• Sasha Kotchubievsky (Mellanox)• Lenny Verkhovsky (Mellanox)• Eitan Zahavi (Mellanox)• Vladimir Koushnir (Mellanox)
March 30 – April 2, 2014 #OFADevWorkshop
![Page 22: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/22.jpg)
22
Development
• Mostly by Mellanox– Review by rest of project team
• Verification/regression effort as well
March 30 – April 2, 2014 #OFADevWorkshop
![Page 23: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/23.jpg)
23
Initial Release
• Path Record Support• Limitations (Not Part of Initial Release)
– QoS routing and policy– Virtualization (alias GUIDs)
• Preview – June• Release - December
March 30 – April 2, 2014 #OFADevWorkshop
![Page 24: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/24.jpg)
24
Future Development Phases
1. IP address and name resolution1. Collect <IP address/name, port> up SSA tree
2. Redistribute mappings
3. Resolve path records directly from IP address/names
2. Event collection and reporting1. Performance monitoring
March 30 – April 2, 2014 #OFADevWorkshop
![Page 25: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/25.jpg)
25
Summary
• A scalable, distributed SA• Works with existing apps with minor modification• Fault tolerant
• Please contact us if interested in deploying this!
March 30 – April 2, 2014 #OFADevWorkshop
![Page 26: Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ec05503460f94bcc329/html5/thumbnails/26.jpg)
#OFADevWorkshop
Thank You