AISC - Electrical and Computer...

15
LTAI AISC Approximate Instruc3on Set Computer Alexandra Ferrerón, Darío Suárez-Gracia Jesús Alastruey-Benedé, Ulya R. Karpuzcu University of Zaragoza University of Minnesota, Twin-CiHes WAX’18, March 25 2018

Transcript of AISC - Electrical and Computer...

Page 1: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

LTAI

AISCApproximateInstruc3onSetComputer

AlexandraFerrerón,DaríoSuárez-GraciaJesúsAlastruey-Benedé,UlyaR.Karpuzcu

UniversityofZaragozaUniversityofMinnesota,Twin-CiHes

WAX’18,March252018

Page 2: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

InstrucHonSetArchitecture(ISA)

2

•Viewfromhardwarestack:•BehavioraldesignspecificaHon•Determineshardwarecomplexity

•Viewfromso\warestack:•DefiniHonofthemachinecapabiliHes•Determinesfunc3onalcompleteness

Page 3: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

ApproximateInstrucHonSetComputer(AISC)•Single-ISAheterogeneousfabric•Eachcomputeengine(coreorfixedfuncHon)supportsasubsetoftheISA•Component(funcHonally-incomplete)ISAmayoverlap•IncompleteISAcanreducemicroarchitecturalcomplexity•therebyimproveperformanceperWa_

•TheunionofincompleteISAsubsetsrendersafuncHonallycompleteISA

3

Page 4: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

ApproximateInstrucHonSetComputer(AISC)•Single-ISAheterogeneousfabric•Eachcomputeengine(coreorfixedfuncHon)supportsasubsetoftheISA•Component(funcHonally-incomplete)ISAmayoverlap•IncompleteISAcanreducemicroarchitecturalcomplexity•therebyimproveperformanceperWa_

•TheunionofincompleteISAsubsetsrendersafuncHonallycompleteISA

4

Codethatcannotbeapproximatedcanrunatfullaccuracy

ImprovedenergyefficiencyifISA-inducedapproximaHonistolerable

Page 5: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

ApproximateInstrucHonSetComputer(AISC)

5

•Howtodetermineincomplete,approximateISAsubsets?•VerHcalapproximaHon:•ExcludelessfrequentlyusedcomplexinstrucHons

•HorizontalapproximaHon:•SimplifieseachinstrucHon,bye.g.,precisionreducHon

Page 6: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

Proof-Of-Concept

6

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2

1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

errors or simply enforced by design. The latter applies forthe following discussion.

2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.

(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.

Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.

2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.

and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.

3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-

quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.

2

Page 7: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

Proof-Of-Concept

7

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2

1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

errors or simply enforced by design. The latter applies forthe following discussion.

2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.

(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.

Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.

2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.

and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.

3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-

quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.

2

Page 8: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

Proof-Of-Concept

8

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2

1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

errors or simply enforced by design. The latter applies forthe following discussion.

2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.

(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.

Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.

2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.

and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.

3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-

quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.

2

Page 9: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

Proof-Of-Concept

9

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2

1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

errors or simply enforced by design. The latter applies forthe following discussion.

2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.

(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.

Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.

2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.

and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.

3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-

quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.

2

double precision ➔ half-precision

Page 10: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

Proof-Of-Concept

10

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2

1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

errors or simply enforced by design. The latter applies forthe following discussion.

2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.

(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.

Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.

2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.

and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.

3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-

quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.

2

DIV ➔ MUL

Page 11: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

Proof-Of-Concept

11

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2

1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

errors or simply enforced by design. The latter applies forthe following discussion.

2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.

(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.

Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.

2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.

and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.

3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-

quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.

2

Upto37%energycutataround10%accuracyloss

Page 12: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

DesignAspects

12

•WhichsubsetoftheISAshouldeachcomputeenginesupport?•HowtomapinstrucHonsequencestocomputeengines?•HowtokeepthepotenHalaccuracylossbounded?•HowtoorchestratemigraHonofcodesequences•fromonecomputeenginetoanotherwithinthecourseofcomputaHon•tolerancetonoisemayvaryfordifferentapplicaHonphases

Page 13: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

DesignAspects

13

•WhichsubsetoftheISAshouldeachcomputeenginesupport?•HowtomapinstrucHonsequencestocomputeengines?•HowtokeepthepotenHalaccuracylossbounded?•HowtoorchestratemigraHonofcodesequences•fromonecomputeenginetoanotherwithinthecourseofcomputaHon•tolerancetonoisemayvaryfordifferentapplicaHonphases

•Mostcri3caldesignaspect:migra3ongranularity•Abreak-evenpointexistsformigra3ongranularity(andfrequency)

Page 14: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

LTAI

AISCApproximateInstruc3onSetComputer

AlexandraFerrerón,DaríoSuárez-GraciaJesúsAlastruey-Benedé,UlyaR.Karpuzcu

[email protected]

WAX’18,March252018

Page 15: AISC - Electrical and Computer Engineeringpeople.ece.umn.edu/groups/altai/Publications_files/wax18_1.pdf · sentative AISC approximations (b)-(d). 2.1 Vertical Approximation The key

AISC

InstrucHonSetArchitecture(ISA)

15

Hardware

Software

Architecture

int main(void) { ... }

Dec

ode

Fetc

h

DIV.D F0, F2, F4

ADD.D F10, F0, F8

SUB.D F8, F8, F14