LPC Speech Coder on the TI C6x DSP
Mark Anderson, Jeff Burke
EE213A / EE298-2Prof. Ingrid Verbauwhede
Summary Implementation platform
Texas Instruments TMS320C6000 Low-quantity cost US $35 (‘C6211)
Architecture clock frequency 150 MHz (‘C6211)
Throughput 75-80 channels @ 8000 samples/sec
Summary Total energy per sample
1.8 uJ/sample ‘Area’
1.2% of cycle budget per chan. per frame
8.5% of unified memory per channel 25% of unified memory for algorithm
Summary Flexibility of implementation
High; programmable processor with C compiler, GUI debugger & simulator
SegSNR_A: ?
SegSNR_Q: 26 dB (voiced segments)
Architecture overview 256-bit VLIW
Two “clustered” data paths Four functional units in each data path
16x16 multiply Two ALUs Data addressing unit
32-bit instruction for each functional unit
(256 bit “instruction” for 8 func. Units)
Data path diagram
Architecture overview Split register file
Only two cross-paths exists Cluster is limited to one source read
from opposite register file per cycle. Data types
8, 16, 32-bit with 40-bit accumulate 40-bit = register pair
Memory architecture ‘C6211 (US$35) has a cache! 4kB L1 Instruction cache (L1P) 4kB L1 Data cache (L1D) 64kB L2 Unified memory and/or
cache Extra DMA channels
Memory architecture
Design Tools Command-line
Compiler, debugger, simulator Code Composer Studio
Same tools Windows NT GUI 30-day “evaluation” license Draconian copy protection, pulls out
the rug from under you
Design Flow Consolidate Matlab reference into
a single function Matlab rewritten C-style Verified C-style Matlab C prototype created Imported into Code Composer,
optimized & simulated
Fixed-point quantization Input samples
16-bit, normalized to [-1,1) <1.15> format used
Coefficient quantization Hamming window, pre-emphasis, FIR <1.15> format used No noticeable change in
characteristics
Fixed-point quantization Most values 16 bit
Take advantage of 16x16 fast multipliers
Remain close to other class implementations
Add metric for overpowered LPC engine Use # of channels as performance
metric
Fixed-point quantization Energy stored in <5.27>
Prevent overflow, provide precision for low energy segments
Temporary values stored in <10.30> Take advantage of extended precision
Modified autocorrelation used <16.0> All whole numbers
Fixed-Point SNR Matlab simulation of magnitude
truncation Tools again.
SegSNR_A = ? SegSNR_Q = 26 dB
Voiced segments only Sent_female test data
Performance results Initial version: 80,000 CPU cycles/frame Optimization
Take advantage of VLIW, pipelining observe assembly, modify C loops
Use TI’s DSP Library Assembly advantage without assembly
Optimized version: 30,182 cycles/frame Had to stop early, still at least 5K cycles
wasted
Performance Then, the tool license expired. The tool would not install on other
machines. TI responded, but wasn’t too helpful. Moral #1: Avoid the evaluation
version. Moral #2: Give tools away to sell
hardware
Cycle count details
Routine % Cycles/frame
Windowing, pre-emphasis 4.3 1285
Energy calc 0.8 254
Autocorrelation in Levinson-Durbin
8.0 2421
Autocorrelation in pitch detection
51 15334
Algorithm total 95 28561
Total w/ housekeeping 30182
Additional optimizations Use more DSPLIB routines
Autocorrelation Assembly-level optimization
Code size reduction? Reduce number of buffers to reduce
L1D usage per frame
Energy per sample ‘C6211 consumes 1.24W
75% high activity / 25% low activity 1.24W / 80 channels
= 15.5mW/channel 15.5 mJ/sec/channel * 1/8000
= 1.8 uJ / sample
Number of channels
150 x 106 cycles/sec x 0.02 sec/frame= 3.0 x 106 cycles/frame
3.0 x 106 cycles/frame / 30,182 cycles= 99 channels
Memory ‘C6211 Cache complicates
estimates Performance is 85-99% of optimal
for typical applications 30,182 cycles becomes
35,508 cycles/frame for 85% efficiency
=> now support only 86 channels
Memory Try to account for off-chip memory
transfers ~220,000 cycles for 150ns fetches
for 80 channels
=> support 75-80 channels
Unable to verify/simulate because of unexpected tool expiration
Memory L2 usage
~16kB Code size thanks to VLIW 512 32-byte instruction clusters More suited for ‘C6201 & larger processors
Remaining used by data for channels 480 bytes each (8.5% of remaining memory)
L1 usage L1P: Can’t tell because of cache L1D: 2.2kB (~56%)
Tool comments Powerful, easy to use IDE… When it worked.
Licensing problems for eval version Debugging support a bit odd
puts/printf
C6x Conclusions Easily support 75-80 channels of
coding 26 dB fixed-point SNR, 16-bit types VLIW = Large code size Cache on a low-end DSP! Good tools,
but draconian copy protection
Top Related