SIMD Optimization in COINS Compiler Infrastructure
description
Transcript of SIMD Optimization in COINS Compiler Infrastructure
SIMD Optimization in COINS Compiler Infrastructure
Mitsugu Suzuki (The University of Electro-Communications)Nobuhisa Fujinami (Sony Computer Entertainment Inc.)
Agenda
COINS SIMD optimizationTwo topics on SIMD optimization Data Size Inference SIMD Benchmark
Current status and required improvements
SIMD optimization‥‥ Concept and decision
implemented as an LIR to LIR transformerrequires no additional special extensions for source languages.source-level optimizable matters are postponed.
→ HIR-level matterex. Vectorization (appropriate loop
unrolling),if-peeling, complex if-conversion, etc.
#define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1))short *v1, *v2, *v3;/* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */ for (i = 0; i < M; i++) // case-A *v1++ = AVE(*v2++, *v3++); for (i = 0; i < M; i++) // case-B v1[i] = AVE(v2[i], v3[i]); for (i = 0; i < M; i += 4) { // case-C v1[i] = AVE(v2[i], v3[i]); v1[i+1] = AVE(v2[i+1], v3[i+1]); ... v1[i+3] = AVE(v2[i+3], v3[i+3]); } for (i = 0; i < M; i += 4) { // case-D v1[0] = AVE(v2[0], v3[0]); v1[1] = AVE(v2[1], v3[1]); ... v1[3] = AVE(v2[3], v3[3]); v1+=4; v2+=4; v3+=4; }
×
○
#define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1))struct { short r, g, b, a;} *u1, *u2, *u3;
/* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */for (i = 0; i < M; i++) { // case-E u1[i].r = AVE(u2[i].r, u2[i].r); u1[i].g = AVE(u2[i].g, u2[i].g); u1[i].b = AVE(u2[i].b, u2[i].b); u1[i].a = AVE(u2[i].a, u2[i].a); }
○
SIMD optimization‥‥ Processing flow
1. If-conversion2. Decompose basic blocks into
DAGs.3. Match LIR patterns to specific
SIMD operation.4. Combine same basic operations.
(parallelization)
(⇒ 3rd page of hand script)
Data size inference ‥‥ Why needed?
#define AVE(x,y) (((x)>>1) + ((y)>>1) + (((x)|(y))&1))
#define AVE(x,y) (((x) + (y) + 1) >> 1)
Two styles of averaging integers:(assumption : Both x and y are given 8 bits unsigned integers.)
9bits8bits
8bits 8bits7bits7bits 8bits
⇒ max 9bits: zero-extension is needed (normal instruction oriented coding)
⇒ max 8bits: no extension is needed (SIMD instruction oriented coding)
But compiler must extend x and y to itsintegral type (typically 32 bits)← Integral promotion rule
Data size inference‥‥ Method
1. Get value range for each node.2. Get altering bits from the value range.3. Get meaningful bits for each node with
given one (from upper node).Getting value ranges and required bits are based on their Inference RulesPatterns of the meaningful bits are matched while instruction selection.
SET
MEM:I8
MEM:I8 MEM:I8
CONVIT:I8
RSHU
ADD
CONVZX
CONST
CONVZX
ADD CONST
1
1
*a = (*b + *c + 1) >> 1;
SET
MEM:I8
MEM:I8
CONVIT:I8
BANDADD
CONVZX CONST
ADD
CONST
1
1
RSHU
MEM:I8
CONVZX CONST
1
RSHUBOR
*a = (*b>>1 + *c>>1 +((*b | *c) & 1));
0..255
0..510
0..511
0..2550..255
1..1
1..1
0..255
0..127
1..1 1..1
0..127
0..255 0..255
0..2551..1
0..254 0..1
0..255
0..255
Data size inference‥‥ Method
1. Get value range for each node.2. Get altering bits from the value range.3. Get meaningful bits for each node with
given one (from upper node).Getting value ranges and required bits are based on their Inference RulesPatterns of the meaningful bits are matched while instruction selection.
SET
MEM:I8
MEM:I8 MEM:I8
CONVIT:I8
RSHU
ADD
CONVZX
CONST
CONVZX
ADD CONST
1
1
*a = (*b + *c + 1) >> 1;
0..255
0..510
0..511
0..2550..255
1..1
1..1
SET
MEM:I8
MEM:I8
CONVIT:I8
BANDADD
CONVZX CONST
ADD
CONST
1
1
0..127
1..1
RSHU
MEM:I8
CONVZX CONST
1
1..1
RSHU
0..127
BOR
0..255 0..255
0..2551..1
0..254 0..1
0..255
*a = (*b>>1 + *c>>1 +((*b | *c) & 1));
0..2550..2558
8
9
9
8 8
9
8
8
8
8
8
8
8
8
8
88
8
Data size inference‥‥ Method
1. Get value range for each node.2. Get altering bits from the value range.3. Get meaningful bits for each node with
given one (from upper node).Getting value ranges and required bits are based on their Inference Rules.Patterns of the meaningful bits are matched while instruction selection.
Data size inference‥‥ Method
1. Get value range for each node.2. Get altering bits from the value range.3. Get meaningful bits for each node with
given one (from upper node).Getting value ranges and required bits are based on their Inference RulesPatterns of the meaningful bits are matched while instruction selection.
SIMD Benchmark‥‥ Why needed?
Existing benchmarks are not suited for tuning of SIMD optimization. SIMD-optimizable patterns are covered with
non-SIMD-optimizable ones. Existing codes are far from SIMD-
optimization (without hole-in-one matching).
Step-wise milestones for SIMD-optimization was required.
SIMD Benchmark‥‥ Design
SIMD-optimizable code patterns were extracted from real media processing applications.Multiple versions were crafted by hand for each code patterns so as covering wide range, from easily SIMD
optimized level to original classified by SIMD optimization techniques execution times are reported for each
version
int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel;} else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;}
acLevel = ((data[i] < 0) ? -data[i] : data[i]) - quant_d_2;acLevel2 = (acLevel * mult) >> SCALEBITS;sum += ((acLevel < quant_m_2) ? 0 : acLevel2);coeff[i] = ((acLevel < quant_m_2) ? 0 : ((data[i] < 0) ? -acLevel2 : acLevel2));
Original If-peeled
and loop-unrolled / not
int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel;} else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;}
acMsk1 = (int)data[i] >> 31;acLevel = ((data[i] & ~acMsk1)| ((-data[i]) & acMsk1)) - quant_d_2;acMsk2 = (acLevel < quant_m_2) ? 0 : 0xffff;acLevel = (acLevel * mult) >> SCALEBITS;sum += acMsk2 & acLevel;coeff[i] = acMsk2 & (((-acLevel) & acMsk1) | (acLevel & (~acMsk1)));
Original If-conversed
and loop-unrolled / not
Current status andrequired improvements
Bone of SIMD opt. has been implemented.Following are MUST Enrichment of template for specific SIMD op. Isolation of machine dependent and
independent part in SIMD opt. Recovery method from failure in SIMD op.
matching. Alignment and overlapping check for pointers .
⇒ will be solved in the next release