Download - Why the Address Translation Scheme Matters?

Why the Address Translation Scheme Matters?

Jiaqing Du

Address Translation/Mapping

• Where is 0x1f344000?• DRAM Devices– A multi-dimensional array– Inside a DIMM: Rank, Bank, Row, Column– Among DIMMs: Memory Controller, Channel

Inside Memory Controller

• Accesses to Different Parts == High Parallelism == High Throughput

• In favor of locality property – Logically adjacent means physically distant

Agenda

• A Scalable Software Router• Performance of A Commodity Server• Memory Translation Disclosure– Experiment Design– Experiment Result

• Understanding the Imbalance• Possible Solutions• Conclusion

A Scalable Software Router

• A Valiant Load-Balanced Mesh• Aggregated Throughput: N x R (bps)

1 2

3N

… 4

R

R

R

R

2R/N R

R

Performance of A Commodity PC

• Experiment Environment– 2 Xeon 1.6GHz sockets, 4 cores/socket– Each of 2 cores share a 8MB L2 cache– 1 GHz FSB, 8GB DDR2 667MHz– 2 MCs manage 4 channels– 4 quad-port 1Gbps NICs (16 ports)– Click 1.6.0 on Linux 2.6.19

• Simple “Point-to-Point” Forwarding• A Chipset Monitoring Tool (Emon)


• Maximum Loss-free Forwarding Rate– 16Gbps input


• Memory Load Distribution

• My work is to dig further– Explain the imbalance– But, we don’t know how an address is mapped :(

stream benchmark 1024B 64B

Disclose Address Translation

• What We Want?– Which bit selects channel, rank, bank, and …– What parallelism really gives us

• What We Have?– Emon: tells us throughput and load distribution

• What We Need?– Enough traffic to one single memory location– Enough traffic to two memory locations,

e.g., 0x1f344000 and 0x1f34100


• Artificial Memory Access Pattern– One writing flow to ADDR1– One flow to ADDR1

The other to ADDR1+2^b ( b = 0, 1, …, 31)• Utilize the Cache– Cache coherency protocol (MESI)– Bind two threads to two cores don’t share L2,

Force them to keep writing to one location– Write to an invalid cache line goes directly to memory– Two threads generate one writing flow


• Experiment Result– ADDR1, ADDR1+2^b

Understand the Imbalance

• Memory Management– Pre-allocated 2KB socket buffer– Reclaimed & reallocated by the kernel

• A Limited number of buffers serve all packets.• A 2KB buffer spans the entire rank-bank grid.• Large Packets (1024B)– Cover at least half of the grid (high parallelism)

• Small Packets (64B)– Hit some elements W.H.P. (poor parallelism)

• In real world, even worse.

Understand the Imbalance

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7

0 1 2 3

4 5 6 70

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7

0 1 2 3

4 5 6 70

Memory Pool

Mapped Grid

1024B

64B

What Can We Do?

• Hack Network Adaptor Driver– Introduce random offset

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7

0 1 2 3

4 5 6 70

Memory Pool

Mapped Grid

What Can We Do?

• Hack Slab Allocator and kmalloc()– Maintain a special slab– Provide access through kmalloc()

0 1 2 3 4 5 6 7

4 5 6 7 0 1 2

0 1 2 3

4 5 6 73

Memory Pool

Mapped Grid

What Can We Do?

• Maintain buffers with various sizes– NIC supports multiple descriptor rings– A hardware feature

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6

0 1 2 3

4 5 6 77

Memory Pool

Mapped Grid

Conclusion

• Figured out memory translation scheme • Explained memory load imbalance• Proposed two possible solutions