University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism:...

35
University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline FastForward for Efficient Pipeline Parallelism: Parallelism: A Cache-Optimized Concurrent Lock-Free A Cache-Optimized Concurrent Lock-Free Queue Queue Tipp Moseley and Manish Vachharajani University of Colorado at Boulder 2008.02.21 John Giacomoni

Transcript of University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism:...

Page 1: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research Lab

FastForward for Efficient Pipeline Parallelism:FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free QueueA Cache-Optimized Concurrent Lock-Free Queue

Tipp Moseley and Manish Vachharajani

University of Colorado at Boulder

2008.02.21

John Giacomoni

Page 2: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Why?Why?Why Pipelines?Why Pipelines?

• Multicore systems are the future

• Many apps can be pipelined if the granularity is fine enough

– ≈ < 1 µs

– ≈ 3.5 x interrupt handler

Page 3: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Fine-GrainFine-GrainPipelining ExamplesPipelining Examples

• Network processing:– Intrusion detection (NID) – Traffic filtering (e.g., P2P filtering)– Traffic shaping (e.g., packet prioritization)

Page 4: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Network ProcessingNetwork ProcessingScenariosScenarios

Link Mbps fps ns/frame

T-1 1.5 2,941 340,000

T-3 45.0 90,909 11,000

OC-3 155.0 333,333 3,000

OC-12 622.0 1,219,512 820

GigE 1,000.0 1,488,095 672

OC-48 2,500.0 5,000,000 200

10 GigE 10,000.0 14,925,373 67

OC-192 9,500.0 19,697,843 51

Page 5: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Core-PlacementsCore-Placements

4x4 NUMA Organization(ex: AMD Opteron Barcelona)

APP

IP OP

Dec Enc

APP

IP

APP

OP

IP

Dec

App

Enc

OP

Page 6: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

ExampleExample3 Stage Pipeline3 Stage Pipeline

Page 7: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

ExampleExample3 Stage Pipeline3 Stage Pipeline

Page 8: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

CommunicationCommunicationOverheadOverhead

Page 9: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

CommunicationCommunicationOverheadOverhead

Locks 320ns

GigE

Page 10: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

CommunicationCommunicationOverheadOverhead

Locks 320ns

GigE

Lamport 160ns

Page 11: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

CommunicationCommunicationOverheadOverhead

Locks 320ns

Lamport 160ns

Hardware 10ns

GigE

Page 12: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

CommunicationCommunicationOverheadOverhead

Locks 320ns

Lamport 160ns

Hardware 10nsFastForward 28ns

GigE

Page 13: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

More Fine-GrainMore Fine-GrainPipelining ExamplesPipelining Examples

• Network processing:– Intrusion detection (NID) – Traffic filtering (e.g., P2P filtering)– Traffic shaping (e.g., packet prioritization)

• Signal Processing– Media transcoding/encoding/decoding– Software Defined Radios

• Encryption– Counter-Mode AES

• Other Domains– Fine-grain kernels extracted from sequential applications

Page 14: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

FastForwardFastForward

• Cache-optimized point-to-point CLF queue1.Fast

2.Robust against unbalanced stages

3.Hides die-die communication

4.Works with strong to weak memory consistency models

Page 15: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Lamport’sLamport’sCLF Queue (1)CLF Queue (1)

lamp_enqueue(data) {

NH = NEXT(head);

while (NH == tail) {};

buf[head] = data;

head = NH;

}

lamp_dequeue(*data) {

while (head == tail) {}

*data = buf[tail];

tail = NEXT(tail);

}

Page 16: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Lamport’sLamport’sCLF Queue (2)CLF Queue (2)

lamp_enqueue(data) {

NH = NEXT(head);

while (NH == tail) {};

buf[head] = data;

head = NH;

}

head tail

buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]

buf[ ] buf[ ] buf[ ] buf[n]

Page 17: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

AMD OpteronAMD OpteronCache ExampleCache Example

M

Page 18: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Lamport’sLamport’sCLF Queue (2)CLF Queue (2)

lamp_enqueue(data) {

NH = NEXT(head);

while (NH == tail) {};

buf[head] = data;

head = NH;

}

head tail

buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]

buf[ ] buf[ ] buf[ ] buf[n]

Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation

Page 19: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Lamport’sLamport’sCLF Queue (3)CLF Queue (3)

lamp_enqueue(data) {

NH = NEXT(head);

while (NH == tail) {};

buf[head] = data;

head = NH;

}

head

buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]

buf[ ] buf[ ] buf[ ] buf[n]

Observe how cachelines will still ping-pong.What if the head/tail comparison was eliminated?

tail

Page 20: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

FastForwardFastForwardCLF Queue (1)CLF Queue (1)

lamp_enqueue(data) {

NH = NEXT(head);

while (NH == tail) {};

buf[head] = data;

head = NH;

}

ff_enqueue(data) {

while(0 != buf[head]);

buf[head] = data;

head = NEXT(head);

}

Page 21: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

buf[1]buf[0]

FastForwardFastForwardCLF Queue (2)CLF Queue (2)

ff_enqueue(data) {

while(0 != buf[head]);

buf[head] = data;

head = NEXT(head);

}

head

buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]

buf[ ] buf[ ] buf[ ] buf[n]

tail

Observe how head/tail cachelines will NOT ping-pong.BUT, “buf” will still cause the cachelines to ping-pong.

Page 22: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

FastForwardFastForwardCLF Queue (3)CLF Queue (3)

ff_enqueue(data) {

while(0 != buf[head]);

buf[head] = data;

head = NEXT(head);

}

head

buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]

buf[ ] buf[ ] buf[ ] buf[n]

tail

Solution: Temporally slip stages by a cacheline.N:1 reduction in coherence misses per stage.

Page 23: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Slip TimingSlip Timing

Page 24: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Slip TimingSlip TimingLostLost

Page 25: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Maintaining SlipMaintaining Slip(Concepts)(Concepts)

• Use distance as the quality metric– Explicitly compare head/tail– Causes cache ping-ponging– Perform rarely

Page 26: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Maintaining SlipMaintaining Slip(Method)(Method)

adjust_slip() {

dist = distance(producer, consumer);

if (dist < *Danger*) {

dist_old = 0;

do {

dist_old = dist;

spin_wait(avg_stage_time * (*OK* - dist));

dist = distance(producer, consumer);

} while (dist < *OK* && dist > dist_old);

}

}

Page 27: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

ComparativeComparativePerformancePerformance

Lamport FastForward

Page 28: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Thrashing andThrashing andAuto-BalancingAuto-Balancing

FastForward (Thrashing) FastForward (Balanced)

Page 29: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

CacheCacheVerificationVerification

FastForward (Thrashing) FastForward (Balanced)

Page 30: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

On/Off DieOn/Off DieCommunicationsCommunications

M

On-die communication

Off-die communication

Page 31: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

On/Off-dieOn/Off-diePerformancePerformance

FastForward (On-Die) FastForward (Off-Die)

Page 32: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

ProvenProvenPropertyProperty

• “In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.”

Page 33: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

WorkWorkin Progressin Progress

• Operating Systems– 27.5 ns/op

• 3.1 % cost reduction vs. reported 28.5 ns

– Reduced jitter

• Applications– 128bit AES encrypting filter

• Ethernet layer encryption at 1.45 mfps• IP layer encryption at 1.51 mfps• ~10 lines of code for each.

Page 34: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Gazing intoGazing intothe Crystal Ballthe Crystal Ball

Locks 320ns

Lamport 160ns

Hardware 10nsFastForward 28ns

GigE

Page 35: University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

University of Colorado at Boulder

Core Research LabUniversity of Colorado at Boulder

Core Research Lab

Shared Memory Accelerated QueuesNow Available!

http://ce.colorado.edu/core

[email protected]