VLSI 2014 IEEE TITLES

Zuara Technologies Battle with bugs

No.82, Station road, Radha nagar, Chrompet, Chennai-44. Mobile: 09095188016. Mail.id: [email protected]

Web site: www.zuaratech.com

82, Station road, Radha nagar,Chrompet Chennai-44

Mob.: 9095188016/9677465689

1. ASIC and FPGA Implementation of the Gaussian Mixture Model Algorithm for

Real-Time Segmentation of High Definition Video

Background identification is a common feature in many video processing systems. This paper

proposes two hardware implementations of the Open CV version of

the Gaussian mixture model (GMM), a background identification algorithm. The implemented

version of the algorithm allows a fast initialization of the background model while an innovative,

hardware-oriented, formulation of the GMM equations makes the proposed circuits able to

perform real-time background identification on highdefinition (HD) video sequences with frame

size 1920 1080. The first of the two circuits is designed with commercial field-programmable

gate-array (FPGA) devices as target. When implemented on Virtex6 vlx75t, the proposed circuit

process 91 HD fps (frames per second) and uses 3% of FPGA logic resources. The second circuit

is oriented to the implementation in UMC-90 nm CMOS standard cell technology, and is

proposed in two versions. Both versions can process at a frame rate higher than 60 HD fps. The

first version uses the constant voltage scaling technique to provide a low power implementation.

It provides silicon area occupation of 28847 m2 and energy dissipation per pixel of 15.3

pJ/pixel. The second version is designed to reduce silicon area utilization and occupies 21847

m2with an energy dissipation of 49.4 pJ/pixel.

2. Design and FPGA Implementation of High-Speed, Fixed-Latency Serial

Transceivers

Fixed-latency serial links are important components of the distributed measurement and control

systems. However, most high-speed Serializer-Deserializer (SerDes) chips do not keep the same

linklatency after each power-up or reset. In this paper, we propose a fixed-





Mob.: 9095188016/9677465689

latency serial transceiver based on dynamic clock phase shifting and changeable delay tuning

technologies. Our solution can process all possible phase offsets between the transmitted and

received clocks, so it relaxes the requirement of fanning in the same reference clock both to the

transmitter and to the receiver. It also eliminates the reset-relock process in the roulette approach.

We present a specific example of implementation based on the serial transceiver in Xilinx Virtex

5 FPGA. The experiment results indicate that our transceiver can achieve a

deterministic latency with sub-nanosecond precision.

3. DART: A Programmable Architecture for NoC Simulation on FPGAs

The increased demand for on-chip communication bandwidth as a result of the multicore trend

has made packet-switched networks-on-chip (NoCs) a more compelling choice for the

communication backbone in next-generation systems . However, NoC designs have many power,

area, and performance tradeoffs in topology, buffer sizes, routing algorithms, and flow control

mechanisms hence, the study of new NoC designs can be very time intensive. To address these

challenges, we propose DART, a fast and flexible FPGA-based NoC simulation architecture.

Rather than laying theNoC out in hardware on the FPGA like previous approaches , , our design

virtualizes the NoC by mapping its components to a generic NoC simulation engine, composed

of a fully connected collection of fundamental components (e.g., routers and flit queues). This

approach has two main advantages: 1) since it is virtualized it can simulate any NoC, and 2)

any NoC can be mapped to the engine without rebuilding it, which can take significant time for a

large FPGA design. We demonstrate 1) that an implementation of DART on a Virtex-II Pro

FPGA can achieve over $(100times)$ speedup over the cycle-based software simulator Booksim

, while maintaining the same level of simulation accuracy, and 2) that a more modern Virtex-6

FPGA can accommodate a 49-node DART implementation.





Mob.: 9095188016/9677465689

4. Defense Against Primary User Emulation Attacks in Cognitive Radio Networks

Using Advanced Encryption Standard

This paper considers primary user emulation attacks in cognitive radio networks operating in the

white spaces of the digital TV (DTV) band. We propose a reliable AES-assisted DTV scheme, in

which an AES-encrypted reference signal is generated at the TV transmitter and used as the sync

bits of the DTV data frames. By allowing a shared secret between the transmitter and the

receiver, the reference signal can be regenerated at the receiver and used to achieve accurate

identification of the authorized primaryusers. In addition, when combined with the analysis on

the autocorrelation of the received signal, the presence of the malicious user can be detected

accurately whether or not the primary user is present. We analyze the effectiveness of the

proposed approach through both theoretical analysis and simulation examples. It is shown that

with the AES-assisted DTV scheme, the primary user, as well as malicious user, can be detected

with high accuracy under primary user emulation attacks. It should be emphasized that the

proposed scheme requires no changes in hardware or system structure except for a plug-in AES

chip. Potentially, it can be applied directly to today's DTV system

under primary useremulation attacks for more efficient spectrum sharing.

5. Energy-Efficient Resource Allocation in OFDM Systems With Distributed Antennas

In this paper, we develop an energy-efficient resource-allocation scheme with proportional

fairness for downlink multiuser orthogonal frequency-division multiplexing

(OFDM) systems with distributedantennas. Our aim is to maximize energy efficiency (EE) under

the constraints of the overall transmit power of each remote access unit (RAU), proportional





Mob.: 9095188016/9677465689

fairness data rates, and bit error rates (BERs). Because of the nonconvex nature of the

optimization problem, obtaining the optimal solution is extremely computationally complex.

Therefore, we develop a low-complexity suboptimal algorithm, which separates

subcarrier allocation and power allocation. For the low-complexity algorithm, we first allocate

subcarriers by assuming equal power distribution. Then, by exploiting the properties of fractional

programming, we transform the nonconvex optimization problem in fractional form into an

equivalent optimization problem in subtractive form, which includes a tractable solution. Next,

an optimalenergy-efficient power-allocation algorithm is developed to maximize EE while

maintaining proportional fairness. Through computer simulation, we demonstrate the

effectiveness of the proposed low-complexity algorithm and illustrate the fundamental tradeoff

between energy- and spectral-efficienttransmission designs.

6. Design Flow for Flip-Flop Grouping in Data-Driven Clock Gating

Clock gating is a predominant technique used for power saving. It is observed that the commonly

used synthesis-based gating still leaves a large amount of redundant clock pulses. Data-

driven gating aims to disable these. To reduce the hardware overhead involved, flip-flops (FFs)

are grouped so that they share a common clock enabling signal. The question of what is

the group size maximizing the power savings is answered in a previous paper. Here we answer

the question of which FFs should be placed in a group to maximize the power reduction. We

propose a practical solution based on the toggling activity correlations of FFs and their physical

position proximity constraints in the layout. Our data-drivenclock gating is integrated into an

Electronic Design Automation (EDA) commercial backend design flow, achieving total power

reduction of 15%-20% for various types of large-scale state-of-the-art industrial and





Mob.: 9095188016/9677465689

academic designs in 40 and 65 manometer process technologies. These savings are achieved on

top of the sClock gating is a predominant technique used for power saving. It is observed that the

commonly used synthesis-based gating still leaves a large amount of

redundant clock pulses. Data-driven gating aims to disable these. To reduce the hardware

overhead involved, flip-flops (FFs) aregrouped so that they share a common clock enabling

signal. The question of what is the group size maximizing the power savings is answered in a

previous paper. Here we answer the question of which FFs should be placed in a group to

maximize the power reduction. We propose a practical solution based on the toggling activity

correlations of FFs and their physical position proximity constraints in the layout. Our data-

driven clock gating is integrated into an Electronic Design Automation (EDA) commercial

backend design flow, achieving total power reduction of 15%-20% for various types of large-

scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technol-

gies. These savings are achieved on top of the savings obtained by clock gating synthesis

performed by commercial EDA tools, and gating manually inserted into the register transfer

level design.avings obtained by clock gating synthesis performed by commercial EDA tools,

and gating manually inserted into the register transfer level design.

7. Effect of Image Downsampling on Steganographic Security

The accuracy of steganalysis in digital images primarily depends on the statistical properties of

neighboring pixels, which are strongly affected by the image acquisition pipeline as well as any

processing applied to the image. In this paper, we study how the detectability of embedding

changes is affected when the cover image is downsampled prior to embedding. This topic is

important for practitioners because the vast majority of images posted on





Mob.: 9095188016/9677465689

websites, image sharing portals, or attached to e-mails are downsampled. It is also relevant to

researchers as the security ofsteganographic algorithms is commonly evaluated on databases of

downsampled images. In the first part of this paper, we investigate empirically how the

steganalysis results depend on the parameters of the resizing algorithm-the choice of the

interpolation kernel, the scaling factor (resize ratio), antialiasing, and the downsampled pixel

grid alignment. We report on several novel phenomena that appear valid universally across the

tested cover sources, steganographic methods, and steganalysis features. This paper continues

with a theoretical analysis of the simplest interpolation kernel - the box kernel. By fitting a

Markov chain model to pixel rows, we analytically compute the Fisher information rate for any

mutually independent embedding operation and derive the proper scaling of the secure payload

with resizing. For least significant bit (LSB) matching and a limited range of downscaling, the

theory fits experiments rather well, which indicates the existence of a new scaling law expressing

the length of the secure payload when the cover size is modified by subsampling.

8. An FPGA-Based Fully Synchronized Design of a Bilateral Filter for Real-Time

Image Denoising

In this paper, a detailed description of a synchronous field-programmable gate array

implementation of abilateral filter for image processing is given. The bilateral filter is chosen for

one unique reason: It reduces noise while preserving details. The design is described on register-

transfer level. The distinctive feature of our design concept consists of changing the clock

domain in a manner that kernel-based processing is possible, which means the processing of the

entire filter window at one pixel clock cycle. This feature of the kernel-based design is supported

by the arrangement of the input data into groups so that the internal clock of the design is a





Mob.: 9095188016/9677465689

multiple of the pixel clock given by a targeted system. Additionally, by the exploitation of the

separability and the symmetry of one filter component, the complexity of the design is widely

reduced. Combining these features, the bilateral filter is implemented as a highly parallelized

pipeline structure with very economical and effective utilization of dedicated resources. Due to

the modularity of the filter design, kernels of different sizes can be implemented with low effort

using our design and given instructions for scaling. As the original form of the bilateral filterwith

no approximations or modifications is implemented, the resulting image quality depends on the

chosen filter parameters only. Due to the quantization of the filter coefficients, only negligible

quality loss is introduced.

9. Subjective evaluation of HEVC and AVC/H.264 in mobile environments

This paper compares the quality of AVC/H.264 and HEVC encoded video in low bandwidth

mobile environments. In this study, the focus within the mobile environment is smart phones.

The key characteristics of a smart phone are smaller screen size, which is usually 3.5 inches

diagonal to 5.0 inches diagonal for high end smart phones and typical cellular network

bandwidth, which is 3G or faster. Subjective evaluations were conducted to evaluate the user

experience on a mobile device with a small screen size and video coded at 200 and 400 Kbps.

The studies showed compelling evidence that a user's experience in low bandwidth mobile

environments is very similar between HEVC and AVC/H.264. The results suggest the benefits of

HEVC over AVC/H.264 in a mobile environment with lower video bitrates and resolutions are

not as clear.





Mob.: 9095188016/9677465689

10. Improved Method to Select the Lagrange Multiplier for Rate-Distortion Based

Motion Estimation in Video Coding

The motion estimation (ME) process used in the H.264/AVC reference software is based on

minimizing a cost function that involves two terms (distortion and rate) that are properly

balanced through a Lagrangian parameter, usually denoted as motion. In this paper we propose

an algorithm to improve the conventional way of estimating motion and, consequently, the ME

process. First, we show that the conventional estimation of motion turns out to be significantly

less accurate when ME-compromising events, which make the ME process to perform poorly,

happen. Second, with the aim of improving the coding efficiency in these cases, an efficient

algorithm is proposed that allows the encoder to choose between three different values of

motion for the Inter 16x16 partition size. To be more precise, for this partition size, the

proposed algorithm allows the encoder to additionally test motion=0 and motionarbitrarily

large, which corresponds to minimum distortion and minimum rate solutions, respectively. By

testing these two extreme values, the algorithm avoids making large ME errors. The

experimental results on video segments exhibiting this type of ME-compromising events reveal

an average rate reduction of 2.20% for the same coding quality with respect to the JM15.1

reference software of H.264/AVC. The algorithm has been also tested in comparison with a

state-of-the-art algorithm called context adaptive Lagrange multiplier. Additionally, two

illustrative examples of the subjective performance improvement are provided.

11. An Overview of Information Hiding in H.264/AVC Compressed Video

Information hiding refers to the process of inserting information into a host to serve specific

purpose(s). In this paper, information hiding methods in the H.264/AVC compressed video

domain are surveyed. First, the general framework of information hiding is conceptualized by

relating the state of an entity to a meaning (i.e., sequences of bits). This concept is illustrated by





Mob.: 9095188016/9677465689

using various data representation schemes such as bit plane replacement, spread spectrum,

histogram manipulation, divisibility, mapping rules, and matrix encoding. Venues at which

information hiding takes place are then identified, including prediction process, transformation,

quantization, and entropy coding. Related information hiding methods at each venue are briefly

reviewed, along with the presentation of the targeted applications, appropriate diagrams, and

references. A timeline diagram is constructed to chronologically summarize the invention of

information hiding methods in the compressed still image and video domains since 1992. A

comparison among the considered information hiding methods is also conducted in terms of

venue, payload, bitstream size overhead, video quality, computational complexity, and video

criteria. Further perspectives and recommendations are presented to provide a better

understanding of the current trend of information hiding and to identify new opportunities for

information hiding in compressed video.

12. VLSI Architecture Design of Guided Filter for 30 Frames/s Full-HD

Video

Filtering is widely used in image and video processing for various applications. Recently, the

guided filter has been proposed and became one of the popular filtering methods. In this paper, to

achieve the computation demand of guided filtering in full-HD video, a double integral image

architecture for guided filter ASIC design is proposed. In addition, a reformation of the guided

filter formula is proposed, which can prevent the error resulted from truncation in the fractional

part and modify the regularization parameter on user's demand. The hardware architecture of

the guided image filter is then proposed and can be embedded in mobile devices to achieve real-

time HD applications. To the best of our knowledge, this paper is also the first ASIC design for

guided image filter. With a TSMC 90-nm cell library, the design can operate at 100 MHz and





Mob.: 9095188016/9677465689

support for Full-HD (1920 1080) 30 frame/s with 92.9K gate counts and 3.2 KB on-chip

memory. Moreover, for the hardware efficiency, our architecture is also the best compared to

other previous works with bilateral filter.

13. Property Analysis of XOR-Based Visual Cryptography

A (k,n) visual cryptographic scheme (VCS) encodes a secret image into n shadow images

(printed on transparencies) distributed among n participants. When any k participants

superimpose their transparencies on an overhead projector (OR operation), the secret image can

be visually revealed by a human visual system without computation. However, the monotone

property of OR operation degrades the visual quality of reconstructed image for OR-based VCS

(OVCS). Accordingly, XOR-based VCS (XVCS), which uses XOR operation for decoding, was

proposed to enhance the contrast. In this paper, we investigate the relation between OVCS and

XVCS. Our main contribution is to theoretically prove that the basis matrices of (k,n)-OVCS can

be used in (k,n)-XVCS. Meantime, the contrast is enhanced 2(k-1)

times.

14. Effectiveness of Leakage Power Analysis Attacks on DPA-Resistant Logic Styles

Under Process Variations

This paper extends the analysis of the effectiveness of Leakage Power Analysis (LPA) attacks to

cryptographic VLSI circuits on which circuit level countermeasures against Differential Power

Analysis (DPA) are adopted. Security metrics used for assessing the DPA-resistance of crypto

core implementations, such as the minimum number to disclosure (MTD) and the asymptotic

correlation coefficient, have been extended to the case of LPA. The LPA-resistance has been

evaluated in terms of MTD as a function of the on chip noise. Noise variances up to 10000 times

greater than the signal variance have been taken into account and LPA attacks have been





Mob.: 9095188016/9677465689

successfully executed for all the logic styles under analysis using less than 100000

measurements. Moreover the role of process variations has been investigated through extensive

Monte Carlo simulations in order to evaluate their impact on the leakage model for the logic

styles under analysis. Results show that LPA attacks can be successfully carried out on the

different anti-DPA logic styles even in presence of process variations. To the best of our

knowledge, this work proves for the first time the effectiveness of LPA attacks in a real scenario

where on chip noise and process variations are taken into account.

15. Data Hiding in Encrypted H.264/AVC Video Streams by Codeword Substitution

Digital video sometimes needs to be stored and processed in an encrypted format to maintain

security and privacy. For the purpose of content notation and/or tampering detection, it is

necessary to perform data hiding in these encrypted videos. In this way, data hiding in encrypted

domain without decryption preserves the confidentiality of the content. In addition, it is more

efficient without decryption followed by data hiding and re-encryption. In this paper, a novel

scheme of data hiding directly in the encrypted version of H.264/AVC video stream is proposed,

which includes the following three parts, i.e., H.264/AVC video encryption, data embedding, and

data extraction. By analyzing the property of H.264/AVC codec, the codewords of

intraprediction modes, the codewords of motion vector differences, and the codewords of

residual coefficients are encrypted with stream ciphers. Then, a data hider may embed additional

data in the encrypted domain by using codeword substitution technique, without knowing the

original video content. In order to adapt to different application scenarios, data extraction can be

done either in the encrypted domain or in the decrypted domain. Furthermore, video file size is

strictly preserved even after encryption and data embedding. Experimental results have

demonstrated the feasibility and efficiency of the proposed scheme.





Mob.: 9095188016/9677465689

16. Optimal Transport for Secure Spread-Spectrum Watermarking of Still Images

This paper studies the impact of secure watermark embedding in digital images by proposing a

practical implementation of secure spread-spectrum watermarking using distortion optimization.

Because strong security properties (key-security and subspace-security) can be achieved using

naturalwatermarking (NW) since this particular embedding lets the distribution of the host and

watermarked signals unchanged, we use elements of transportation theory to minimize the global

distortion. Next, we apply this new modulation, called transportation NW (TNW), to design a

secure watermarking scheme for grayscale images. The TNW uses a multiresolution image

decomposition combined with a multiplicative embedding which is taken into account at the

distribution level. We show that the distortion solely relies on the variance of the wavelet

subbands used during the embedding. In order to maximize a target robustness after JPEG

compression, we select different combinations of subbands offering the lowest Bit Error Rates

for a target PSNR ranging from 35 to 55 dB and we propose an algorithm to select them. The use

of transportation theory also provides an average PSNR gain of 3.6 dB on PSNR with respect to

the previous embedding for a set of 2000 images.

17. Impulse Noise Estimation and Removal for OFDM Systems

Orthogonal Frequency Division Multiplexing (OFDM) is a modulation scheme that is widely

used in wired and wireless communication systems. While OFDM is ideally suited to deal with

frequency selective channels and AWGN, its performance may be dramatically impacted by the

presence of impulse noise. In fact, very strong noise impulses in the time domain might result in

the erasure of whole OFDM blocks of symbols at the receiver. Impulse noise can be mitigated by

considering it as a sparse signal in time, and using recently developed algorithms for sparse

signal reconstruction. We propose an algorithm that utilizes the guard band null subcarriers for

the impulse noise estimation and cancellation. Instead of relying on ell_1 minimization as done





Mob.: 9095188016/9677465689

in some popular general-purpose compressive sensing schemes, the proposed method jointly

exploits the specific structure of this problem and the available a priori information for sparse

signal recovery. The computational complexity of the proposed algorithm is very competitive

with respect to sparse signal reconstruction schemes based on ell_1 minimization. The proposed

method is compared with respect to other state-of-the-art methods in terms of achievable rates

for an OFDM system with impulse noise and AWGN.

18. Bit-Level Optimization of Adder-Trees for Multiple Constant Multiplications

for Efficient FIR Filter Implementation

Multiple constant multiplication (MCM) scheme is widely used for implementing transposed

direct-formFIR filters. While the research focus of MCM has been on more effective common

subexpression elimination, the optimization of adder-trees, which sum up the computed sub-

expressions for each coefficient, is largely omitted. In this paper, we have identified the resource

minimization problem in the scheduling of adder-tree operations for the MCM block, and

presented a mixed integer programming (MIP) based algorithm for more efficient MCM-based

implementation of FIR filters. Experimental result shows that up to 15% reduction of area and

11.6% reduction of power (with an average of 8.46% and 5.96% respectively) can be achieved

on the top of already optimized adder/subtractor network of the MCM block.

19. Frequency Estimation of Distorted and Noisy Signals in Power Systems by FFT-

Based Approach





Mob.: 9095188016/9677465689

This paper focuses on the accurate frequency estimation of power signals corrupted by a

stationary white noise. The noneven item interpolation FFT based on the triangular self-

convolution window is described. A simple analytical expression for the variance of noise

contribution on the frequency estimation is derived, which shows the variances of frequency

estimation are proportional to the energy of the adopted window. Based on the proposed method,

the noise level of the measurement channel can be estimated, and optimal parameters (e.g.,

sampling frequency and window length) of the interpolation FFT algorithm that minimize the

variances of frequency estimation can thus be determined. The application in a power quality

analyzer verified the usefulness of the proposed method.

20. Accurate and Efficient On-Chip Spectral Analysis for Built-In Testing and

Calibration Approaches

The fast Fourier transform (FFT) algorithm is widely used as a standard tool to carry out spectral

analysis because of its computational efficiency. However, the presence of multiple tones

frequently requires a fine frequency resolution to achieve sufficient accuracy, which imposes the

use of a large number of FFT points that results in large area and power overheads. In this paper,

an FFT method is proposed for on-chip spectral analysis of multi-tone signals with particular

harmonic and intermodulation components. This accurate FFT analysis approach is based on

coherent sampling, but it requires a significantly smaller number of points to make

the FFT realization more suitable for on-chip built-in testing and calibration applications that

require area and power efficiency. The technique was assessed by comparing the simulation

results from the proposed method of single and multiple tones with the simulation results

obtained from the FFT of coherently sampled tones. The results indicate that the proper selection





Mob.: 9095188016/9677465689

of test tone frequencies can avoid spectral leakage even with multiple narrowly spaced tones.

When low-frequency signals are captured with an analog-to-digital converter (ADC) for on-chip

analysis, the overall accuracy is limited by the ADC's resolution, linearity, noise, and bandwidth

limitations. Post-layout simulations of a 16-point FFT showed that third-order intermodulation

(IM3) testing with two tones can be performed with 1.5-dB accuracy for IM3 levels of up to 50

dB below the fundamental tones that are quantized with a 10-bit resolution. In a 45-nm CMOS

technology, the layout area of the 16-point FFT for on-chip built-in testing is 0.073 mm2, and its

estimated power consumption is 6.47 mW.

21. Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low

Adaptation-Delay

In this paper, we present an efficient architecture for the implementation of a delayed least mean

square adaptive filter. For achieving lower adaptation-delay and area-delay-power efficient

implementation, we use a novel partial product generator and propose a strategy for optimized

balanced pipelining across the time-consuming combinational blocks of the structure. From

synthesis results, we find that the proposed design offers nearly 17% less area-delay product

(ADP) and nearly 14% less energy-delay product (EDP) than the best of the existing systolic

structures, on average, for filter lengths N=8, 16, and 32. We propose an efficient fixed-point

implementation scheme of the proposed architecture, and derive the expression for steady-state

error. We show that the steady-state mean squared error obtained from the analytical result

matches with the simulation result. Moreover, we have proposed a bit-level pruning of the

proposed architecture, which provides nearly 20% saving in ADP and 9% saving in EDP over

the proposed structure before pruning without noticeable degradation of steady-state-error

performance.





Mob.: 9095188016/9677465689

22. Efficient Integer DCT Architectures for High Efficiency Video CODEC standard

In this paper, we present area- and power-efficient architectures for the implementation of

integer discrete cosine transform (DCT) of different lengths to be used in High Efficiency Video

Coding (HEVC). We show that an efficient constant matrix-multiplication scheme can be used to

derive parallel architectures for 1-D integer DCT of different lengths. We also show that the

proposed structure could be reusable for DCT of lengths 4, 8, 16, and 32 with a throughput of 32

DCT coefficients per cycle irrespective of the transform size. Moreover, the proposed

architecture could be pruned to reduce the complexity of implementation substantially with only

a marginal affect on the coding performance. We propose power-efficient structures for folded

and full-parallel implementations of 2-D DCT. From the synthesis result, it is found that the

proposed architecture involves nearly 14% less area-delay product (ADP) and 19% less energy

per sample (EPS) compared to the direct implementation of the reference algorithm, on average,

for integer DCT of lengths 4, 8, 16, and 32. Also, an additional 19% saving in ADP and 20%

saving in EPS can be achieved by the proposed pruning algorithm with nearly the same

throughput rate. The proposed architecture is found to support ultrahigh definition 7680 4320

at 60 frames/s video, which is one of the applications of HEVC.

23. Low-Cost Low-Power ASIC Solution for Both DAB+ and DAB Audio Decoding

DAB+ is the upgraded version of digital audio broadcasting (DAB). DAB and DAB+ coexist in

many countries, so receivers are required to be compatible with both standards. In this paper, a

solution integrating an MPEG1-LayerII (MP2) decoder and an advanced audio coding

(AAC) low-complexity (AAC LC) decoder is proposed to provide basic audio decoding for both

DAB and DAB+. It also utilizes simple methods to improve high frequencies and stereo quality

instead of complicated spectrum band replication and parametric stereo. A highly integrated low-

power audio decoder design compatible with DAB/DAB+ and using a purely ASIC approach is





Mob.: 9095188016/9677465689

presented. As a result of the system structure optimization and hardware sharing, the audio

decoder is fabricated in 1P4M 0.18- m CMOS technology using only 3.2 mm2 silicon area

(including 147 456 bits RAM and 170 496 bits ROM). The powerconsumption of the audio

decoder is 10.4 mW for DAB audio decoding and 8.5 mW for DAB+ audio decoding.

Laboratory and field tests show that the function is correct and the audio quality is good for

receiving both DAB and DAB+. The audio decoder is thus proven to be a low-cost low-

power solution for the two existing DAB standards.

24. Low-Power Digital Signal Processor Architecture for Wireless Sensor Nodes

Radio communication exhibits the highest energy consumption in wireless sensor nodes. Given

their limited energy supply from batteries or scavenging, these nodes must trade data

communication for on-the-node computation. Currently, they are designed around off-the-

shelf low-power microcontrollers. But by employing a more appropriate processing element, the

energy consumption can be significantly reduced. This paper describes the design and

implementation of the newly proposed folded-tree architecture for on-the-node data processing

in wireless sensor networks, using parallel prefix operations and data locality in hardware.

Measurements of the silicon implementation show an improvement of 10-20 in terms of energy

as compared to traditional modern micro-controllers found in sensor nodes.

25. Memory Footprint Reduction for Power-Efficient Realization of 2-D Finite

Impulse Response Filters

We have analyzed memory footprint and combinational complexity to arrive at a systematic

design strategy to derive area-delay-power-efficient architectures for two-dimensional (2-D)

finite impulse response (FIR) filter. We have presented novel block-based structures for





Mob.: 9095188016/9677465689

separable and non-separable filters with less memory footprint by memory sharing and memory-

reuse along with appropriate scheduling of computations and design of storage architecture. The

proposed structures involve L times less storage per output (SPO), and nearly L times less energy

consumption per output (EPO) compared with the existing structures, where L is the input block-

size. They involve L times more arithmetic resources than the best of the corresponding existing

structures, and produce L times more throughput with less memory band-width (MBW) than

others. We have also proposed separate generic structures for separable and non-separable filter-

banks, and a unified structure of filter-bank constituting symmetric and general filters. The

proposed unified structure for 6 parallel filters involves nearly 3.6L times more multipliers, 3L

times more adders, (N2-N+2) less registers than similar existing unified structure, and computes

6L times more filter outputs per cycle with 6L times less MBW than the existing design, where

N is FIR filter size in each dimension. ASIC synthesis result shows that for filter size (4 4),

input-block size L=4, and image-size (512 512), proposed block-based non-separable and

generic non-separable structures, respectively, involve 5.95 times and 11.25 times less area-

delay-product (ADP), and 5.81 times and 15.63 times less EPO than the corresponding existing

structures. The proposed unified structure involves 4.64 times less ADP and 9.78 times less EPO

than the corresponding existing structure.

26. Ultra-High Throughput Low-Power Packet Classification

Packet classification is used by networking equipment to sort packets into flows by comparing

their headers to a list of rules, with packets placed in the flow determined by the matched rule. A

flow is used to decide a packet's priority and the manner in which it is processed. Packet

classification is a difficult task due to the fact that all packets must be processed at wire speed

and rulesets can contain tens of thousands of rules. The contribution of this paper is a hardware

accelerator that can classify up to 433 million packets per second when using rule sets containing





Mob.: 9095188016/9677465689

tens of thousands of rules with a peak power consumption of only 9.03 W when using a Stratix

III field-programmable gate array (FPGA). The hardware accelerator uses a modified version of

the HyperCuts packet classification algorithm, with a new pre-cutting process used to reduce the

amount of memory needed to save the search structure for large rulesets so that it is small

enough to fit in the on-chip memory of an FPGA. The modified algorithm also removes the need

for floating point division to be performed when classifying a packet, allowing higher clock

speeds and thus obtaining higher throughputs.

27. A Configurable and Low-Power Mixed Signal SoC for Portable ECG Monitoring

Applications

This paper describes a mixed-signal ECG System-on-chip (SoC) that is capable of implementing

configurable functionality with low-power consumption

for portable ECG monitoring applications. A low-voltage and high performance analog front-end

extracts 3-channel ECG signals and single channel impedance measurement with

high signal quality. A custom digital signal processor provides the configurability and advanced

functionality like motion artifact removal and R peak detection. The SoC is implemented in

0.18m CMOS process and consumes minimum 31.1W from a 1.2V.

28. Partial Access Mode: New Method for Reducing Power Consumption of Dynamic

Random Access Memory

Demands have been placed on a dynamic random access memory (DRAM) to not only have

increasedmemory capacity and data transfer speed, but also have reduced operating and standby

currents. When a system uses a DRAM, a refresh operation is necessary because of its data





Mob.: 9095188016/9677465689

retention time restriction: each bit of the DRAM is stored as an amount of electrical charge in a

storage capacitor that is discharged by the leakage current. Power consumption for the refresh

operation increases in proportion to the memory capacity. We propose

a new method to reduce the refresh powerconsumption by effectively extending the memory cell

retention time. Conversion from 1 cell/bit to$2^{N}$ cells/bit reduces the variation in the

retention time among memory cells. Although active powerincreases by a factor of $2^{N}$ ,

the refresh time increases by more than $2^{N}$ as a consequence of the fact that the majority

decision does better than averaging for the tail distribution of retention time. The conversion can

be realized very simply from the structure of the DRAM array circuit, and it reducesthe

frequency of disturbance and power consumption by two orders of magnitude. On the basis of

this conversion method, we propose

a partial access mode to reduce power consumption dynamically when the full memory capacity

is not required.

29. Reliability-Oriented Placement and Routing Algorithm for SRAM-Based FPGAs

As the feature size shrinks to the nanometer scale, SRAM-based FPGAs will become

increasingly vulnerable to soft errors. Existing reliability-

oriented placement and routing approaches primarily focus on reducing the fault occurrence

probability (node error rate) of soft errors. However, our analysis shows that, besides the fault

occurrence probability, the propagation probability (error propagation probability) plays an

important role and should be taken into consideration. In this paper, we first propose a cube-

based analysis algorithm to efficiently and accurately estimate the error propagation

probability. Based on such a model, we propose a novel reliability-

oriented placement and routingalgorithm that combines both the fault occurrence probability and





Mob.: 9095188016/9677465689

the error propagation probability together to enhance system-level robustness against soft errors.

Experimental results show that, compared with the baseline versatile place and route technique,

the proposed scheme can reduce the failure rate by 20.73%, and increase the mean time between

failures by 39.44%.

30. Time-Based All-Digital Technique for Analog Built-in Self-Test

A scheme for built-in self-test of analog signals with minimal area overhead for measuring on-

chip voltages in an all-digital manner is presented. The method is well suited for a distributed

architecture, where the routing of analog signals over long paths is minimized. A clock is routed

serially to the sampling heads placed at the nodes of analog test voltages. This sampling head

present at each testnode, which consists of a pair of delay cells and a pair of flip-flops, locally

converts the test voltage to a skew between a pair of subsampled signals, thus giving rise to as

many subsampled signal pairs as the number of nodes. To measure a certain analog voltage, the

corresponding subsampled signal pair is fed to a delay measurement unit to measure the skew

between this pair. The concept is validated by designing a test chip in a UMC 130-nm CMOS

process. Sub-millivolt accuracy for static signals is demonstrated for a measurement time of a

few seconds, and an effective number of bits of 5.29 is demonstrated for low-bandwidth signals

in the absence of sample-and-hold circuitry.

31. Improved 8-Point Approximate DCT for Image and Video Compression Requiring

Only 14 Additions

Video processing systems such as HEVC requiring low energy consumption needed for the

multimedia market has lead to extensive development in fast algorithms for the efficient





Mob.: 9095188016/9677465689

approximation of 2-D DCT transforms. The DCT is employed in a multitude of compression

standards due to its remarkable energy compaction properties. Multiplier-free approximate DCT

transforms have been proposed that offer superior compression performance at very low circuit

complexity. Such approximations can be realized in digital VLSI hardware using additions and

subtractions only, leading to significant reductions in chip area and power consumption

compared to conventional DCTs and integer transforms. In this paper, we introduce a novel 8-

point DCT approximation that requires only 14 addition operations and no multiplications. The

proposed transform possesses low computational complexity and is compared to state-of-the-art

DCT approximations in terms of both algorithm complexity and peak signal-to-noise ratio. The

proposed DCT approximation is a candidate for reconfigurable video standards such as HEVC.

The proposed transform and several other DCT approximations are mapped to systolic-array

digital architectures and physically realized as digital prototype circuits using FPGA technology

and mapped to 45 nm CMOS technology.

32. Reconfigurable CORDIC-Based Low-Power DCT Architecture Based on Data

Priority

This paper presents a low-power coordinate rotation digital computer (CORDIC)-based

reconfigurable discrete cosine transform (DCT) architecture. The main idea of this paper is based

on the interesting fact that all the computations in DCT are not equally important in generating

the frequency domain outputs. Considering the importance difference in the DCT coefficients,

the number of CORDIC iterations can be dynamically changed to efficiently tradeoff image

quality for power consumption. Thus, the computational energy can be significantly reduced

without seriously compromising the image quality. The proposed CORDIC-based 2-D DCT

architecture is implemented using 0.13 m CMOS process, and the experimental results show





Mob.: 9095188016/9677465689

that our reconfigurable DCT achieves power savings ranging from 22.9% to 52.2% over the

CORDIC-based Loeffler DCT at the cost of minor image quality degradations.

33. Data Encoding Techniques for Reducing Energy Consumption in Network-on-Chip

As technology shrinks, the power dissipated by the links of a network-on-chip (NoC) starts to

compete with the power dissipated by the other elements of the communication subsystem,

namely, the routers and the network interfaces (NIs). In this paper, we present a set of data

encoding schemes aimed at reducing the power dissipated by the links of an NoC. The proposed

schemes are general and transparent with respect to the underlying NoC fabric (i.e., their

application does not require any modification of the routers and link architecture). Experiments

carried out on both synthetic and real traffic scenarios show the effectiveness of the proposed

schemes, which allow to save up to 51% ofpower dissipation and 14% of energy consumption

without any significant performance degradation and with less than 15% area overhead in the NI.

34. Achieving High-Performance On-Chip Networks With Shared-Buffer Routers

On-chip routers typically have buffers dedicated to their input or output ports for temporarily

storing packets in case contention occurs on output physical channels. Buffers, unfortunately,

consume significant portions of router area and power budgets. While running a traffic trace,

however, not all input ports of routers have incoming packets needed to be transferred

simultaneously. Therefore, a large number of buffer queues in the network are empty and other

queues are mostly busy. This observation motivates us to design router architecture with shared

queues (RoShaQ), router architecture that maximizes buffer utilization by allowing the sharing





Mob.: 9095188016/9677465689

multiple buffer queues among input ports. Sharing queues, in fact, makes using buffers more

efficient hence is able to achieve higher throughput when the network load becomes heavy. On

the other side, at light traffic load, our router achieves low latency by allowing packets to

effectively bypass these shared queues. Experimental results on a 65-nm CMOS standard-cell

process show that over synthetic traffics RoShaQ has 17% less latency and 18% higher

saturation throughput than a typical virtualchannel (VC) router. Because of its higher

performance, RoShaQ consumes 9% less energy per transferred packet than VC router given the

same buffer space capacity. Over real multitask applications and E3S embedded benchmarks

using near-optimal NMAP mapping algorithm, RoShaQ has 32% lower latency than VC router

and targeting the same application throughput with 30% lower energy per packet.

35. Energy Efficiency Optimization Through Codesign of the Transmitter and Receiver

in High-Speed On-Chip Interconnects

A novel equalized global link architecture and driver-receiver codesign flow are proposed for

high-speed and low-energy on-chip communication by utilizing a continuous-time linear

equalizer (CTLE). The proposed global link is analyzed using a linear system method, and the

formula of CTLE eye opening is derived to provide high-level design guidelines and insights.

Compared with the separate driver-receiver design flow, over 50% energy reduction is observed.

The final optimal solution achieves 20-Gb/s signaling over 10 mm, 2.6- m pitch on-chip

transmission line with 15.5-ps/mm latency and 0.196-pJ/b energy using 45-nm technology.

Monte Carlo simulation also shows that 3 / for power and delay variation in the proposed

global link are 13.1% and 4.6%, respectively.

VLSI 2014 IEEE TITLES

Documents

Transcript of VLSI 2014 IEEE TITLES