Why the Address Translation Scheme Matters?
Jiaqing Du
Address Translation/Mapping
• Where is 0x1f344000?• DRAM Devices– A multi-dimensional array– Inside a DIMM: Rank, Bank, Row, Column– Among DIMMs: Memory Controller, Channel
Inside Memory Controller
• Accesses to Different Parts == High Parallelism == High Throughput
• In favor of locality property – Logically adjacent means physically distant
Agenda
• A Scalable Software Router• Performance of A Commodity Server• Memory Translation Disclosure– Experiment Design– Experiment Result
• Understanding the Imbalance• Possible Solutions• Conclusion
A Scalable Software Router
• A Valiant Load-Balanced Mesh• Aggregated Throughput: N x R (bps)
1 2
3N
… 4
R
R
R
R
2R/N R
R
Performance of A Commodity PC
• Experiment Environment– 2 Xeon 1.6GHz sockets, 4 cores/socket– Each of 2 cores share a 8MB L2 cache– 1 GHz FSB, 8GB DDR2 667MHz– 2 MCs manage 4 channels– 4 quad-port 1Gbps NICs (16 ports)– Click 1.6.0 on Linux 2.6.19
• Simple “Point-to-Point” Forwarding• A Chipset Monitoring Tool (Emon)
Performance of A Commodity PC
• Maximum Loss-free Forwarding Rate– 16Gbps input
Performance of A Commodity PC
• Memory Load Distribution
• My work is to dig further– Explain the imbalance– But, we don’t know how an address is mapped :(
stream benchmark 1024B 64B
Disclose Address Translation
• What We Want?– Which bit selects channel, rank, bank, and …– What parallelism really gives us
• What We Have?– Emon: tells us throughput and load distribution
• What We Need?– Enough traffic to one single memory location– Enough traffic to two memory locations,
e.g., 0x1f344000 and 0x1f34100
Disclose Address Translation
• Artificial Memory Access Pattern– One writing flow to ADDR1– One flow to ADDR1
The other to ADDR1+2^b ( b = 0, 1, …, 31)• Utilize the Cache– Cache coherency protocol (MESI)– Bind two threads to two cores don’t share L2,
Force them to keep writing to one location– Write to an invalid cache line goes directly to memory– Two threads generate one writing flow
Disclose Address Translation
• Experiment Result– ADDR1, ADDR1+2^b
Understand the Imbalance
• Memory Management– Pre-allocated 2KB socket buffer– Reclaimed & reallocated by the kernel
• A Limited number of buffers serve all packets.• A 2KB buffer spans the entire rank-bank grid.• Large Packets (1024B)– Cover at least half of the grid (high parallelism)
• Small Packets (64B)– Hit some elements W.H.P. (poor parallelism)
• In real world, even worse.
Understand the Imbalance
0 1 2 3 4 5 6 7
1 2 3 4 5 6 7
0 1 2 3
4 5 6 70
0 1 2 3 4 5 6 7
1 2 3 4 5 6 7
0 1 2 3
4 5 6 70
Memory Pool
Mapped Grid
1024B
64B
What Can We Do?
• Hack Network Adaptor Driver– Introduce random offset
0 1 2 3 4 5 6 7
1 2 3 4 5 6 7
0 1 2 3
4 5 6 70
Memory Pool
Mapped Grid
What Can We Do?
• Hack Slab Allocator and kmalloc()– Maintain a special slab– Provide access through kmalloc()
0 1 2 3 4 5 6 7
4 5 6 7 0 1 2
0 1 2 3
4 5 6 73
Memory Pool
Mapped Grid
What Can We Do?
• Maintain buffers with various sizes– NIC supports multiple descriptor rings– A hardware feature
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6
0 1 2 3
4 5 6 77
Memory Pool
Mapped Grid
Conclusion
• Figured out memory translation scheme • Explained memory load imbalance• Proposed two possible solutions
Top Related