SIMD Optimization in COINS Compiler Infrastructure

SIMD Optimization in COINS Compiler Infrastructure

Mitsugu Suzuki (The University of Electro-Communications)Nobuhisa Fujinami (Sony Computer Entertainment Inc.)

Agenda

COINS SIMD optimizationTwo topics on SIMD optimization Data Size Inference SIMD Benchmark

Current status and required improvements

SIMD optimization‥‥ Concept and decision

implemented as an LIR to LIR transformerrequires no additional special extensions for source languages.source-level optimizable matters are postponed.

→ HIR-level matterex. Vectorization (appropriate loop

unrolling),if-peeling, complex if-conversion, etc.

#define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1))short *v1, *v2, *v3;/* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */ for (i = 0; i < M; i++) // case-A *v1++ = AVE(*v2++, *v3++); for (i = 0; i < M; i++) // case-B v1[i] = AVE(v2[i], v3[i]); for (i = 0; i < M; i += 4) { // case-C v1[i] = AVE(v2[i], v3[i]); v1[i+1] = AVE(v2[i+1], v3[i+1]); ... v1[i+3] = AVE(v2[i+3], v3[i+3]); } for (i = 0; i < M; i += 4) { // case-D v1[0] = AVE(v2[0], v3[0]); v1[1] = AVE(v2[1], v3[1]); ... v1[3] = AVE(v2[3], v3[3]); v1+=4; v2+=4; v3+=4; }

×

○

#define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1))struct { short r, g, b, a;} *u1, *u2, *u3;

/* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */for (i = 0; i < M; i++) { // case-E u1[i].r = AVE(u2[i].r, u2[i].r); u1[i].g = AVE(u2[i].g, u2[i].g); u1[i].b = AVE(u2[i].b, u2[i].b); u1[i].a = AVE(u2[i].a, u2[i].a); }

○

SIMD optimization‥‥ Processing flow

1. If-conversion2. Decompose basic blocks into

DAGs.3. Match LIR patterns to specific

SIMD operation.4. Combine same basic operations.

(parallelization)

(⇒ 3rd page of hand script)

Data size inference ‥‥ Why needed?

#define AVE(x,y) (((x)>>1) + ((y)>>1) + (((x)|(y))&1))

#define AVE(x,y) (((x) + (y) + 1) >> 1)

Two styles of averaging integers:(assumption : Both x and y are given 8 bits unsigned integers.)

9bits8bits

8bits 8bits7bits7bits 8bits

⇒ max 9bits: zero-extension is needed (normal instruction oriented coding)

⇒ max 8bits: no extension is needed (SIMD instruction oriented coding)

But compiler must extend x and y to itsintegral type (typically 32 bits)← Integral promotion rule

Data size inference‥‥ Method

1. Get value range for each node.2. Get altering bits from the value range.3. Get meaningful bits for each node with

given one (from upper node).Getting value ranges and required bits are based on their Inference RulesPatterns of the meaningful bits are matched while instruction selection.

SET

MEM:I8

MEM:I8 MEM:I8

CONVIT:I8

RSHU

ADD

CONVZX

CONST

CONVZX

ADD CONST

1

1

*a = (*b + *c + 1) >> 1;

SET

MEM:I8

MEM:I8

CONVIT:I8

BANDADD

CONVZX CONST

ADD

CONST

1

1

RSHU

MEM:I8

CONVZX CONST

1

RSHUBOR

*a = (*b>>1 + *c>>1 +((*b | *c) & 1));

0..255

0..510

0..511

0..2550..255

1..1

1..1

0..255

0..127

1..1 1..1

0..127

0..255 0..255

0..2551..1

0..254 0..1

0..255

0..255

SET

MEM:I8

MEM:I8 MEM:I8

CONVIT:I8

RSHU

ADD

CONVZX

CONST

CONVZX

ADD CONST

1

1

*a = (*b + *c + 1) >> 1;

0..255

0..510

0..511

0..2550..255

1..1

1..1

SET

MEM:I8

MEM:I8

CONVIT:I8

BANDADD

CONVZX CONST

ADD

CONST

1

1

0..127

1..1

RSHU

MEM:I8

CONVZX CONST

1

1..1

RSHU

0..127

BOR

0..255 0..255

0..2551..1

0..254 0..1

0..255

*a = (*b>>1 + *c>>1 +((*b | *c) & 1));

0..2550..255８

８

９

９

８８

９

８

８

８

８

８

８

８

８

８

８８

８



given one (from upper node).Getting value ranges and required bits are based on their Inference Rules.Patterns of the meaningful bits are matched while instruction selection.

SIMD Benchmark‥‥ Why needed?

Existing benchmarks are not suited for tuning of SIMD optimization. SIMD-optimizable patterns are covered with

non-SIMD-optimizable ones. Existing codes are far from SIMD-

optimization (without hole-in-one matching).

Step-wise milestones for SIMD-optimization was required.

SIMD Benchmark‥‥ Design

SIMD-optimizable code patterns were extracted from real media processing applications.Multiple versions were crafted by hand for each code patterns so as covering wide range, from easily SIMD

optimized level to original classified by SIMD optimization techniques execution times are reported for each

version

int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel;} else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;}

acLevel = ((data[i] < 0) ? -data[i] : data[i]) - quant_d_2;acLevel2 = (acLevel * mult) >> SCALEBITS;sum += ((acLevel < quant_m_2) ? 0 : acLevel2);coeff[i] = ((acLevel < quant_m_2) ? 0 : ((data[i] < 0) ? -acLevel2 : acLevel2));

Original If-peeled

and loop-unrolled / not

int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel;} else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;}

acMsk1 = (int)data[i] >> 31;acLevel = ((data[i] & ~acMsk1)| ((-data[i]) & acMsk1)) - quant_d_2;acMsk2 = (acLevel < quant_m_2) ? 0 : 0xffff;acLevel = (acLevel * mult) >> SCALEBITS;sum += acMsk2 & acLevel;coeff[i] = acMsk2 & (((-acLevel) & acMsk1) | (acLevel & (~acMsk1)));

Original If-conversed

and loop-unrolled / not

Current status andrequired improvements

Bone of SIMD opt. has been implemented.Following are MUST Enrichment of template for specific SIMD op. Isolation of machine dependent and

independent part in SIMD opt. Recovery method from failure in SIMD op.

matching. Alignment and overlapping check for pointers .

⇒ will be solved in the next release

SIMD Optimization in COINS Compiler Infrastructure

Documents

Transcript of SIMD Optimization in COINS Compiler Infrastructure