Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms...
-
Upload
ruby-hudgins -
Category
Documents
-
view
218 -
download
0
Transcript of Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms...
![Page 1: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/1.jpg)
Using SIMD Registers and instructions to Enable Instruction-Level Parallelism in Sorting Algorithms
Yuanyuan SunFeiteng Yang
![Page 2: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/2.jpg)
Source Source ACM Symposium on Parallel
Algorithms and Architectures Proceedings of the nineteenth annual ACM
symposium on Parallel algorithms and architectures
Authors Timothy Furtak
José Nelson Amaral Robert Niewiadomski
![Page 3: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/3.jpg)
Outline Introduction Sorting network Sorting algorithms Experimental evaluation Contributions
![Page 4: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/4.jpg)
Introduction Use SIMD resources to improve the
performance of sorting algorithms for short sequence.
Initial inspiration: need for Fast sorting of short sequences implementation of Graphics rendering in
interactive video game
SIMD machineries
![Page 5: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/5.jpg)
Introduction
SIMD machineries X86-64’s SSE2 (Streaming SIMD Extensions 2) G5’s AltiVec
AltiVec,SSE2: SIMD instruction sets, both feature 128-bit vector registers
![Page 6: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/6.jpg)
Sorting network a comparator network produces a sorted
output for any possible input sequence. COMP(a, b) — the inputs are two storage units: memory
locations, registers, or vector-register elements
— a and b, each containing a numerical input.
![Page 7: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/7.jpg)
Sorting network Size: the total number of comparators in the
network. Depth: the length of the critical path in its
dependence graph.
![Page 8: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/8.jpg)
Sorting network
![Page 9: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/9.jpg)
Sorting network A comparator moves the larger value to
the left, and the smaller value to the right.
For instance, Figure1size=5,width=3;Inputs: a = 7, b = 2, c = 5, d = 9Output: a = 9, b = 7, c = 5, d = 2.
![Page 10: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/10.jpg)
Supporting hardware for Sorting Network The comparator required by a sorting
network is easily constructed using these two operations, a copy instruction, and a temporary variable.
Min and max instructionsmin(a, b) = a : a ≤ b b : otherwisemax(a, b) = a : a ≥ b b : otherwise
![Page 11: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/11.jpg)
Supporting hardware for Sorting Network x86-64 architectures supports the SSE2
min and max operations that return the minimum (maximum) packed single-precision floating-point values.
![Page 12: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/12.jpg)
Supporting hardware for Sorting Network Width: the number of vectors being sorted.x86-64 has 16 XMM vector registers, and each
register can hold 4 floating-point values.Sorting the values in n XMM registers using a
sorting network produces 4 sorted streams of data of length n. 1 ≤ n < 16, one register must be reserved as temporary storage for the swap of values.
![Page 13: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/13.jpg)
Three sorting methods
Two pass sorting with insertion sorting
Two pass sorting with merge sorting
One pass sorting (Register sorting)
![Page 14: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/14.jpg)
Tow pass sorting In the first phase
the SIMD registers and instructions are used to generate a partially-sorted output.
In the second phase a standard sorting algorithm — insertion sort and mergesort are investigated in this paper — finishes the sorting.
![Page 15: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/15.jpg)
First phase: SIMD sortVector registers
A1 B1 C1 D1
A2 B2 C2 D2
An Bn Cn Dn
……
After SIMD sort:
![Page 16: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/16.jpg)
Second phase Insertion sort
Merge sort
A1<A5<A9 A2<A6<A10
A3<A7<A11 A4<A8<A12
A1<A2<A3 A4<A5<A6
A7<A8<A9 A10<A11<A12
A1 A2 A3 A4
A5 A6 A7 A8
A9 A10
A11
A12
A1 A4 A7 A10
A2 A5 A8 A11
A3 A6 A9 A12
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
A11
A12
![Page 17: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/17.jpg)
One pass sorting (Register sorting)
Algorithm input Initial state Align a set of comparators Write values back to memory
![Page 18: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/18.jpg)
4-elements example
P1={comp(a,c) comp(b,d)} P2={comp(a,b) comp(c,d)} P3={comp(b,c)}
![Page 19: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/19.jpg)
One concrete example
![Page 20: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/20.jpg)
SSE2 instructions used
![Page 21: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/21.jpg)
The method is also applied to sort Key-pointer pairs and D-heaps.
![Page 22: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/22.jpg)
Evaluation
![Page 23: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/23.jpg)
Contributions Effectively use SIMD resources to improve
performance of sorting short sequence through the reduction of memory references and increases in ILP.
![Page 24: Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d835503460f94a68e69/html5/thumbnails/24.jpg)
Contributions 1.three algorithms that use the SIMD machinery
for efficient in-register sorting of short sequences
2.a method to use iterative-deepening search to find fast instruction sequences to move data within the SIMD registers
3.an extensive experimental study that indicates the elimination of loads, stores, branches correlates well with improvement performance.