Addition and subtraction with and without regrouping study guide
Silicon Photonic Switch-Enabled Server Regrouping Using ...
Transcript of Silicon Photonic Switch-Enabled Server Regrouping Using ...
![Page 1: Silicon Photonic Switch-Enabled Server Regrouping Using ...](https://reader034.fdocuments.us/reader034/viewer/2022050613/62749596dccdb64a4d16c751/html5/thumbnails/1.jpg)
Ziyi Zhu, Shijia Yan, Madeleine Strom Glick, Min Yee Teh, and Keren Bergman
Lightwave Research Lab, Columbia University
New York, NY, US
Email: [email protected]
Silicon Photonic Switch-Enabled Server Regrouping UsingBandwidth Steering for Distributed Deep Learning Training
![Page 2: Silicon Photonic Switch-Enabled Server Regrouping Using ...](https://reader034.fdocuments.us/reader034/viewer/2022050613/62749596dccdb64a4d16c751/html5/thumbnails/2.jpg)
Rev PA1Rev PA1 2
Motivation
Aggregation
Core
ToR
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Server
1 2 3 4
Servers
Logically grouped servers
Electrical Packet Switch (EPS)
5 6
7
• Under the top-of-rack (ToR) switch the full bandwidth can be utilized, but across racks
constrained bandwidth are experienced
• Distributed deep learning workloads can require many server nodes and show strong
communication patterns between these nodes
![Page 3: Silicon Photonic Switch-Enabled Server Regrouping Using ...](https://reader034.fdocuments.us/reader034/viewer/2022050613/62749596dccdb64a4d16c751/html5/thumbnails/3.jpg)
Rev PA1Rev PA1 3
Motivation
SiP OCSBandwidth Steering
Above the ToR
Aggregation
Core
ToR
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Server
1 2 3 4
5 6
7
• Previous work [1-2] show silicon photonic (SiP) based bandwidth steering has the capability ofmitigating the bottleneck at the core network level
• However, it is not ideal and does not improve the job locality
Servers
Logically grouped servers
EPS
OCS: Optical Circuit Switch
[1] Michelogiannakis, George, et al. "Bandwidth steering in HPC using silicon nanophotonics." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019.[2] Shen, Yiwen, et al. "Accelerating of high performance data centers using silicon photonic switch-enabled bandwidth steering." 2018 European Conference on Optical Communication (ECOC). IEEE, 2018.
![Page 4: Silicon Photonic Switch-Enabled Server Regrouping Using ...](https://reader034.fdocuments.us/reader034/viewer/2022050613/62749596dccdb64a4d16c751/html5/thumbnails/4.jpg)
Rev PA1Rev PA1 4
MotivationFixed servers
Regrouped servers
SiP OCS
SiP OCS
SiP OCS
Bandwidth SteeringAbove the ToR
Server Regrouping
Aggregation
Core
ToR
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Server
1 2 3 4
EPS
5 6
7
• In this work, the SiP OCSs are proposed to be also inserted between the ToR EPSs and
servers
• SiP-enabled bandwidth steering above the ToR switches can still be applied when the
port count of the OCS is limited
![Page 5: Silicon Photonic Switch-Enabled Server Regrouping Using ...](https://reader034.fdocuments.us/reader034/viewer/2022050613/62749596dccdb64a4d16c751/html5/thumbnails/5.jpg)
Rev PA1Rev PA1 5
Silicon Photonic Switches and Switching
2 x 2 x 2λ Microring-assisted space-and-
wavelength selective switch, reprinted from [3]
64 x 64 Mach–Zehnder interferometer
(MZI)-based switch, reprinted from [4]
240 x 240 switch implemented by micro-
electromechanical system (MEMS)-actuated
directional couplers, reprinted from [5]
• CMOS compatible manufacturing processes
• Small footprint
• Promise for power-efficient and low fabrication cost
interconnects
• Various switching types – spatial, wavelength selective,
space-and-wavelength selective* Demonstrate and develop the control plane of SiP based switching for datacenter
networks
[3] Huang, Yishen, et al. "Push—pull microring-assisted space-and-wavelengthselective switch." Optics letters 45.10 (2020): 2696-2699.[4] Chu, Tao, et al. "Fast, high-radix silicon photonic switches." 2018 OpticalFiber Communications Conference and Exposition (OFC). IEEE, 2018.[5] Seok, Tae Joon, et al. "Wafer-scale silicon photonic switches beyond die sizelimit." Optica 6.4 (2019): 490-494.
![Page 6: Silicon Photonic Switch-Enabled Server Regrouping Using ...](https://reader034.fdocuments.us/reader034/viewer/2022050613/62749596dccdb64a4d16c751/html5/thumbnails/6.jpg)
Rev PA1Rev PA1 6
Silicon Photonic Switch Control
• GPIO: Configuration and trigging bits• Linux/Ubuntu: Xilinx PetaLinux, Ubuntu FS, UIO Drive, TCP/IP• Custom SiP switch daughter card*: Co-packaged DAC/ADC circuitry, SiP switches, and fiber arrays -
interfaced through FMC connectors.
![Page 7: Silicon Photonic Switch-Enabled Server Regrouping Using ...](https://reader034.fdocuments.us/reader034/viewer/2022050613/62749596dccdb64a4d16c751/html5/thumbnails/7.jpg)
Rev PA1Rev PA1 7
Control Plane Workflow
Network Optimization• Server regrouping• Bandwidth steering above the ToR
Topology Management• Link Establishment• Link Removal
Job/Traffic Requirements
• Logically grouped servers• Link Monitoring
Electronic Packet Switches SiP Switch/Network Controllers
Flow Update (OpenFlow)Reconfiguration Request
Network Control Plane
Data Plane
![Page 8: Silicon Photonic Switch-Enabled Server Regrouping Using ...](https://reader034.fdocuments.us/reader034/viewer/2022050613/62749596dccdb64a4d16c751/html5/thumbnails/8.jpg)
Rev PA1Rev PA1 8
Distributed Deep Learning Training
Synchronized Training: Ring allreduce Asynchronized Training: Parameter server and workers
torch.distributed: initialize the process group, ip[rank0]: portmodel.to(device) # GPUtorch.nn.parallel.DistributedDataParallel: Gradients synchronization communications
M
M M
M
Rank 0
Rank 1
Rank 2
Rank 3
Neural networks: VGG [6], Dataset: Imagenette (https://github.com/fastai/imagenette)
[6] Simonyan, Karen, et al. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
G M
M M
G G
W W
Rank 0, Rank 0’
Rank 1 Rank 1’torch.multiprocessing: multiple process groups
model.to(device) # GPU; model.share_memory() # shared
modeltorch.distributed: initialize a process group, ip[rank0]: port
initialize another process group, ip[rank0’]: porttorch.distributed.broadcast: distribute weights from PS to workertorch.distributed.reduce: collect gradients from worker to PS
M: MachineW: WeightsG: Gradients
GG
G
![Page 9: Silicon Photonic Switch-Enabled Server Regrouping Using ...](https://reader034.fdocuments.us/reader034/viewer/2022050613/62749596dccdb64a4d16c751/html5/thumbnails/9.jpg)
Rev PA1Rev PA1 9
Testbed – Baseline, Server Regrouping, and Bandwidth Steering
5 6 9 10
Baseline
Aggregated
EPS1 EPS2 EPS3 EPS4
EPS5 EPS6
EPS7
Server Regrouping + Bandwidth Steering Above the ToR
5 6 9 10
SiP OCS
Released
SiP OCS
EPS1 EPS2 EPS3 EPS4
EPS5 EPS6
EPS7
Server Regrouping
5 6 9 10
SiP OCS
Released
EPS1 EPS2 EPS3 EPS4
EPS5 EPS6
EPS7
Electronic Packet Switches
Servers
GPU Servers
![Page 10: Silicon Photonic Switch-Enabled Server Regrouping Using ...](https://reader034.fdocuments.us/reader034/viewer/2022050613/62749596dccdb64a4d16c751/html5/thumbnails/10.jpg)
Rev PA1Rev PA1 10
Experimental Setup
Configuration 2
12
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Wavelength (nm)P
ow
er
(-d
Bm
)
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Po
we
r (-
dB
m)
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Wavelength (nm)
Po
we
r (-
dB
m)
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Po
we
r (-
dB
m)
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Wavelength (nm)
Po
we
r (-
dB
m)
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Po
we
r (-
dB
m)
RX1
RX4
RX2
RX5
RX3
RX6
Configuration 1
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Wavelength (nm)
Po
we
r (-
dB
m)
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Po
we
r (-
dB
m)
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Wavelength (nm)
Po
we
r (-
dB
m)
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Po
we
r (-
dB
m)
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Wavelength (nm)
Po
we
r (-
dB
m)
1544 1546 1548 1550 1552 1554 1556
-20
-30
-40
-50
-60
-70
Po
we
r (-
dB
m)
RX1
RX4
RX2
RX5
RX3
RX6
PC – Polarization Controller EDFA – Erbium-Doped Fiber AmplifierDUX – Optical MultiplexerAMP – Electrical AmplifierDAC – Digital to Analog Converterλ1 = 1545.32nm λ2 = 1546.92nm λ3 = 1553.33nmλ4 = 1554.94nm λ5 = 1554.94nm λ6 = 1556.55nm
![Page 11: Silicon Photonic Switch-Enabled Server Regrouping Using ...](https://reader034.fdocuments.us/reader034/viewer/2022050613/62749596dccdb64a4d16c751/html5/thumbnails/11.jpg)
Rev PA1Rev PA1 11
Acknowledgement:
This work was partly supported by the U.S. Department of Energy (DoE) SBIR Photonic-Storage Subsystem Input/Output (P-SSIO) Interface Project, by Advanced Research Projects Agency-Energy (ARPA-E) under the Enlightened Project, and by National Security Agency (NSA) Laboratory for Physical Sciences (LPS) Research Initiative
Thank you