Srikrishna Bhashyam, Joseph R. Cavallaro, and …sridhar/research/asap-ppt.pdfArea-Time efficient :...
Transcript of Srikrishna Bhashyam, Joseph R. Cavallaro, and …sridhar/research/asap-ppt.pdfArea-Time efficient :...
Efficient VLSI architectures for basebandsignal processing in wireless base-station
receivers
Sridhar RajagopalSrikrishna Bhashyam, Joseph R. Cavallaro, and
Behnaam Aazhang
This work is supported by Nokia, TI, TATP and NSF
Motivation
Computationally complex algorithms for base-stations
– multiple users, high data rates
– matrix inversions, floating point accuracy needed
– DSP solutions infeasible for real-time [S.Das’99]
Real-time implementations for baseband receiver?
– multiuser channel estimation
*S.Das et al., “Arithmetic Acceleration Techniques for Wireless Base-station Receivers”, Asilomar 1999
Contributions
New estimation scheme
– designed from an implementation perspective
– bit-streaming, fixed-point architecture
– reduced complexity, same error rate performance
Real-time architecture design– exploit bit-level parallelism
– area-constrained, time-constrained
– real-time with minimum area
Baseband signal processing
MultipleUsers
Base-Station Receiver
MultiuserChannel
estimation
MultiuserDetection Decoding
Antenna
Information Bits
TrackingTraining
Channel estimation
Direct Path
Reflected Path
Noise +MAI
User 1 User 2
Base Station
Estimates unknown fading amplitudes and asynchronousdelays.
Need for multiuser channel estimation
Detector performance depends on estimation accuracy
Best estimator : Maximum Likelihood
=> jointly estimate parameters for all users
=> Multiuser channel estimation
Single-user sliding correlator used for implementation
�=L
Hiibr rbR
Ti
Libb bbR �=
Multiuser channel estimation algorithm
- Training/Tracking bits
- Received signal N - Spreading gain (typically fixed ,e.g: 32) K - Number of users (variable, <=N) - Maximum Likelihood channel estimate
bi
ri
A
brbb RA*R =
N*K2
N*K2br
K2*K2bb
Ni
2Ki
CA
CR
RCr
}1,1{b
∈
∈
ℜ∈∈
−∈
Outline
Background
Channel Estimation - An implementation perspective
VLSI architectures
– Area-constrained, Time-constrained, Area-Time efficient
DSP Comparisons and Conclusions
Iterative scheme for channel estimation
Bit-streaming, method of gradient descent
Stable convergence behavior with µ
Simple fixed-point architecture
T00
TLL
)1i(bb
)i(bb b*bb*bRR −+= −
H00
HLL
)1i(br
)i(br r*br*bRR −+= −
)RR*A(AA )i(br
)i(bb
)1i()1i()i( −µ−= −−
4 5 6 7 8 9 10 11 1210
-3
10-2
10-1 Comparison of Bit Error Rates (BER)
Signal to Noise Ratio (SNR)
BE
R
Iterative Channel Est. Original Channel Est.
O(K2N)
O(K3+K2N)
Simulations - Static multipath channel
SINR = 0 dB
Paths =3
Training =150 bits
Spreading N = 31
Users K = 15
Outline
Background
Channel Estimation - An implementation perspective
VLSI architectures– Area-constrained, Time-constrained, Area-Time efficient
DSP Comparisons and Conclusions
Design specifications
32 Users (K)
32 spreading code length (N)
Target = 128 Kbps
– 4000 cycles available at 500 MHz
Single cycle addition/multiplication
Task decomposition
IterateCorrelationMatrices (Per Bit)
AO(4K2N,8)
RbrO(2KN,8)
RbbO(2K2,8)
TIME
ChannelEstimate
to Detector
b0(2K,1)
Tracking Window
r0(N,8)
bL(2K,1)
rL(N,8)
L
Architecture design
XNOR gates, UP/DOWN counters
T00
TLL
)1i(bb
)i(bb b*bb*bRR −+= −
H00
HLL
)1i(br
)i(br r*br*bRR −+= −
8-bit adders
)RR*A(AA )i(br
)i(bb
)1i()1i()i( −µ−= −−
8-bit multipliers [Schulte’93]
* Schulte, Swartzlander “Truncated Multiplication with Correction Constant”, Workshop on VLSI Signal Processing,1993
Area-constrained : Min. area, not real- time
b0
bL MUX Counter
Rbb A(i)
DEMUXMUX
MAC
Add/Sub
Add/Sub
Subtract
Subtract
A(i-1)
U/D
Load Store
ji
i j
j jr0rL
bL
b0
16
8
8
88
8 8
1
11
1
1
1
1
1
1
88
88
Rbr
>>8
816
T00
TLL
)1i(bb
)i(bb b*bb*bRR −+= −
H00
HLL
)1i(br
)i(br r*br*bRR −+= − )RR*A(AA )i(
br)i(
bb)1i()1i()i( −µ−= −−
Channel Estimate
Area-constrained : Hardware used
Blocks Quantity Full AdderCells
Complex Total
Counter 1*8 8 - 8
Multiplier 1*8 64 *2 128
Adders 3*8 + 2*16 56 *2 112
Total Area 248FA cells
Total Time(N=K=32)
4K2N 128,000cycles
Time-constrained : Real time, large area
b*bT
b0*b0T
bL
b0
MUX
Rbr
MUX
rL
r0
MUX
Rbb A
Mult
Subtract >>
Subtract
2K*12K*1
2K*1 K(2K-1)*1
K(2K-1)*1
2K2*8
2KN*16
2KN*162KN*8
2K*1
N*8
N*8
N*8
2KN*8
2KN*8
ChannelEstimate
T00
TLL
)1i(bb
)i(bb b*bb*bRR −+= −
H00
HLL
)1i(br
)i(br r*br*bRR −+= −
)RR*A(AA )i(br
)i(bb
)1i()1i()i( −µ−= −−
Time-constrained : Hardware used
Blocks Quantity Full AdderCells
Complex Total
Counter 2K2*8 16K2 - 16K2
Multiplier 4K2N*8 256K2N *2 512K2N
Adders 2KN*16 +2KN*8 +4K2N*16
48KN +64K2N
*2 96KN +128K2N
Total Area(N=K=32)
20,000,000FA cells
Total Time Log2(2K) 6 cycles
Area-Time efficient architecture design
Area - constrained– single 8-bit multiplier– cycles (128,000) [3.81 Kbps, 248 FA Cells]
Time-constrained– 8-bit multipliers– log2(2K) cycles (6) [83.33 Mbps, 20,000,000 FA Cells]
Goal : real-time with minimum areaDifferent parallelism levels for multipliers
N4K2
N4K2
Area-Time efficient : Real-time, min. area
bL*bLT b0*b0
T
bL b0
MUX
MUX
rL
r0
MUX
Mult
Subtract >>
Subtract
2K*1 2K*1
2K*12K*1
2K*1 2K*8
2K*8
1*16
1*161*8
1*1
1*8
N*8
N*8
1*8
Rbr
Counters
StoreLoad
RbbA(i)
DEMUXMUX
A(i-1)
1*8
Adder
1*8
2K*1
2K*8
2K*8
T00
TLL
)1i(bb
)i(bb b*bb*bRR −+= −
H00
HLL
)1i(br
)i(br r*br*bRR −+= −
)RR*A(AA )i(br
)i(bb
)1i()1i()i( −µ−= −−
Channel Estimate
Area-Time efficient : Hardware used
Blocks Quantity Full AdderCells
Complex Total
Counter 2K*8 16K - 16K
Multiplier 2K*8 128K *2 256K
Adders 2K*16 +2*8 + 1*16
32K + 32 *2 64K + 64
Total Area(N=K=32)
10,000FA cells
Total Time 2KN 2,000cycles
Outline
Background
Channel Estimation - An implementation perspective
VLSI architectures– Area-constrained, Time-constrained, Area-Time efficient
DSP Comparisons and Conclusions
DSP comparisons
Implementation ClockRate
Full AdderCells
Data Rates
C67 DSP 166 MHz - 1.02 KbpsArea 500 MHz 248 3.81 Kbps
: : : :Area-Time 500 MHz 104 256 Kbps
: : : :Time 500 MHz 2x107 83.33 Mbps
DSPs unable to exploit bit-level parallelismInefficient storage of bitsUnable to replace bit-multiplications by add/sub.
Scalability of architectures
Design for maximum number of users in the system
Fewer users– turn off functional units to reduce power
– reconfigure hardware for higher data rates (FPGA)
Investigating K-user design using K/2-user designs.
Investigating DSP extensions
Conclusions
New estimation scheme– designed from an implementation perspective– bit-streaming, fixed-point architecture– reduced complexity, same error rate performance
Real-time architecture designs– exploit bit-level parallelism– area-constrained, time-constrained– real-time with minimum area
=> Real-time architectures for base-band signal processing