HARDWARE IMPLEMENTATION OF DATA COMPRESSION …falchier/tesidottorato/tesi_final.pdf ·...

UNIVERSITA DEGLI STUDI DI BOLOGNA

FACOLTA DI SCIENZE MATEMATICHE FISICHE E NATURALI

DOTTORATO DI RICERCA IN FISICA XIV ciclo

HARDWARE IMPLEMENTATION OF

DATA COMPRESSION ALGORITHMS

IN THE ALICE EXPERIMENT

Tesi di Dottorato

di:

Dott. Davide Falchieri

Tutori:

Prof. Maurizio Basile

Prof. Enzo Gandolfi

Coordinatore:

Prof. Giovanni Venturi

Anno Accademico 2000/2001

UNIVERSITA DEGLI STUDI DI BOLOGNA

FACOLTA DI SCIENZE MATEMATICHE FISICHE E NATURALI

DOTTORATO DI RICERCA IN FISICA XIV ciclo

HARDWARE IMPLEMENTATION OF

DATA COMPRESSION ALGORITHMS

IN THE ALICE EXPERIMENT

Tesi di Dottorato

di:

Dott. Davide Falchieri

Tutori:

Prof. Maurizio Basile

Prof. Enzo Gandolfi

Coordinatore:

Prof. Giovanni Venturi

Parole chiave: ALICE, data compression, CARLOS, wavelets, VHDL

Anno Accademico 2000/2001

Contents

Introduction ix

1 The ALICE experiment 1

1.1 The Inner Tracking System . . . . . . . . . . . . . . . . . . . 2

1.1.1 Tracking in ALICE . . . . . . . . . . . . . . . . . . . . 3

1.1.2 Physics of the ITS . . . . . . . . . . . . . . . . . . . . 4

1.1.3 Layout of the ITS . . . . . . . . . . . . . . . . . . . . . 6

1.2 Design of the drift layers . . . . . . . . . . . . . . . . . . . . . 8

1.3 The SDDs (Silicon Drift Detectors) . . . . . . . . . . . . . . . 10

1.4 SDD readout system . . . . . . . . . . . . . . . . . . . . . . . 12

1.4.1 Front-end module . . . . . . . . . . . . . . . . . . . . . 14

1.4.2 Event-buffer strategy . . . . . . . . . . . . . . . . . . . 17

1.4.3 End-ladder module . . . . . . . . . . . . . . . . . . . . 18

1.4.4 Choice of the technology . . . . . . . . . . . . . . . . . 19

2 Data compression techniques 21

2.1 Applications of data compression . . . . . . . . . . . . . . . . 22

2.2 Remarks on information theory . . . . . . . . . . . . . . . . . 23

2.3 Compression techniques . . . . . . . . . . . . . . . . . . . . . 24

2.3.1 Lossless compression . . . . . . . . . . . . . . . . . . . 25

2.3.2 Lossy compression . . . . . . . . . . . . . . . . . . . . 25

2.3.3 Measures of performance . . . . . . . . . . . . . . . . . 25

2.3.4 Modelling and coding . . . . . . . . . . . . . . . . . . . 26

2.4 Lossless compression techniques . . . . . . . . . . . . . . . . . 27

2.4.1 Huffman coding . . . . . . . . . . . . . . . . . . . . . . 27

v

CONTENTS

2.4.2 Run Length encoding . . . . . . . . . . . . . . . . . . . 31

2.4.3 Differential encoding . . . . . . . . . . . . . . . . . . . 32

2.4.4 Dictionary techniques . . . . . . . . . . . . . . . . . . . 33

2.4.5 Selective readout . . . . . . . . . . . . . . . . . . . . . 34

2.5 Lossy compression techniques . . . . . . . . . . . . . . . . . . 35

2.5.1 Zero supression . . . . . . . . . . . . . . . . . . . . . . 35

2.5.2 Transform coding . . . . . . . . . . . . . . . . . . . . . 36

2.5.3 Subband coding . . . . . . . . . . . . . . . . . . . . . . 41

2.5.4 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.6 Implementation of compression algorithms . . . . . . . . . . . 51

3 1D compression algorithm and implementations 55

3.1 Compression algorithms for SDD . . . . . . . . . . . . . . . . 55

3.2 1D compression algorithm . . . . . . . . . . . . . . . . . . . . 56

3.3 1D algorithm performances . . . . . . . . . . . . . . . . . . . . 58

3.3.1 Compression coefficient . . . . . . . . . . . . . . . . . . 59

3.3.2 Reconstruction error . . . . . . . . . . . . . . . . . . . 60

3.4 CARLOS v1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.4.1 Board description . . . . . . . . . . . . . . . . . . . . . 62

3.4.2 CARLOS v1 design flow . . . . . . . . . . . . . . . . . 65

3.4.3 Functions performed by CARLOS v1 . . . . . . . . . . 67

3.4.4 Tests performed on CARLOS v1 . . . . . . . . . . . . 68

3.5 CARLOS v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.5.1 The firstcheck block . . . . . . . . . . . . . . . . . . . 71

3.5.2 The barrel shifter block . . . . . . . . . . . . . . . . . . 72

3.5.3 The fifo block . . . . . . . . . . . . . . . . . . . . . . . 73

3.5.4 The event-counter block . . . . . . . . . . . . . . . . . 75

3.5.5 The outmux block . . . . . . . . . . . . . . . . . . . . 76

3.5.6 The feesiu (toplevel) block . . . . . . . . . . . . . . . . 81

3.5.7 CARLOS-SIU interface . . . . . . . . . . . . . . . . . . 82

3.6 CARLOS v2 design flow . . . . . . . . . . . . . . . . . . . . . 87

3.7 Tests performed on CARLOS v2 . . . . . . . . . . . . . . . . . 89

vi

CONTENTS

4 2D compression algorithm and implementation 91

4.1 2D compression algorithm . . . . . . . . . . . . . . . . . . . . 91

4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 91

4.1.2 How the 2D algorithm works . . . . . . . . . . . . . . . 95

4.1.3 Compression coefficient . . . . . . . . . . . . . . . . . . 96

4.1.4 Reconstruction error . . . . . . . . . . . . . . . . . . . 97

4.2 CARLOS v3 vs. the previous prototypes . . . . . . . . . . . . 98

4.3 The final readout architecture . . . . . . . . . . . . . . . . . . 101

4.4 CARLOS v3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.5 CARLOS v3 building blocks . . . . . . . . . . . . . . . . . . . 103

4.5.1 The channel block . . . . . . . . . . . . . . . . . . . . 105

4.5.2 The encoder block . . . . . . . . . . . . . . . . . . . . 105

4.5.3 The barrel15 block . . . . . . . . . . . . . . . . . . . . 107

4.5.4 The fifonew32x15 block . . . . . . . . . . . . . . . . . 108

4.5.5 The channel-trigger block . . . . . . . . . . . . . . . . 111

4.5.6 The ttc-rx-interface block . . . . . . . . . . . . . . . . 112

4.5.7 The fifo-trigger block . . . . . . . . . . . . . . . . . . . 112

4.5.8 The event-counter block . . . . . . . . . . . . . . . . . 113

4.5.9 The outmux block . . . . . . . . . . . . . . . . . . . . 113

4.5.10 The trigger-interface block . . . . . . . . . . . . . . . . 116

4.5.11 The cmcu block . . . . . . . . . . . . . . . . . . . . . . 117

4.5.12 The pattern-generator block . . . . . . . . . . . . . . . 119

4.5.13 The signature-maker block . . . . . . . . . . . . . . . . 121

4.6 Digital design flow for CARLOS v3 . . . . . . . . . . . . . . . 122

4.7 CARLOS layout features . . . . . . . . . . . . . . . . . . . . . 123

5 Wavelet based compression algorithm 125

5.1 Wavelet based compression algorithm . . . . . . . . . . . . . . 126

5.1.1 Configuration parameters of the multiresolution algo-

rithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.2 Multiresolution algorithm optimization . . . . . . . . . . . . . 129

5.2.1 The Wavelet Toolbox from Matlab . . . . . . . . . . . 130

5.2.2 Choice of the filters . . . . . . . . . . . . . . . . . . . . 131

vii

CONTENTS

5.2.3 Choice of the dimensionality, number of levels and thresh-

old value . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.3 Choice of the architecture . . . . . . . . . . . . . . . . . . . . 141

5.3.1 Simulink and the Fixed-Point Blockset . . . . . . . . . 141

5.3.2 Choice of the architecture . . . . . . . . . . . . . . . . 143

5.4 Multiresolution algorithm performances . . . . . . . . . . . . . 149

5.5 Hardware implementation . . . . . . . . . . . . . . . . . . . . 151

Conclusions 159

Bibliography 161

viii

Introduction

This thesis work has been aimed at the hardware implementation of data

compression algorithms to be applied to High Energy Physics Experiments.

The amount of data that will be produced by LHC experiments at CERN

is of the order of magnitude of 1 GByte/s. Cost constraints on magnetic

tapes and data acquisition systems (optical fibres, readout boards) require

to apply on-line data compression on the front-end electronics of the different

detectors. This leads to the search of the compression algorithms allowing to

achieve a high compression ratio, while keeping low the value of the recon-

struction error. In fact a high compression coefficient can only be achieved

at the expense of some loss on the physical data.

The thesis contains the description of the hardware implementation of com-

pression algorithms applied to the ALICE experiment for what concerns the

SDD (Silicon Drift Detector) readout chain. The total amount of data pro-

duced by SDDs is 32.5 MBytes per event, while the reserved space on mag-

netic tapes for permanent storage is 1.5 MBytes. This means that the com-

pression coefficient has to be at least 22. Beside that, since the p-p interaction

rate is 1000 Hz, data compression hardware has to complete its job within 1

ms. This leads to the search for high performances compression algorithms

for what concerns both compression ratio and execution speed.

The thesis contains a description of the design and implementation of 3

prototypes of the ASIC CARLOS (Compression And Run Length encOd-

ing Subsystem) which deals with the on-line data compression, packing and

transmission to the standard ALICE data acquisition system. CARLOS v1

and v2 contain a uni-dimensional compression algorithm based on threshold,

run length encoding, differential encoding and Huffman coding techniques.

ix

Introduction

CARLOS v3 was meant to contain a bi-dimensional compression algorithm

that obtains a better compression ratio than 1D with a lower physical data

loss. Nevertheless, for time reasons, the design of CARLOS v3 sent to the

foundy contains a simple 1D look-up table based compression algorithm. The

2D algorithm is about to be implemented in the next prototype, which should

be the final version of CARLOS. The first two prototypes have been tested

with good results; the third one is in realization phase up to now and its test

will begin from February 2002.

Beside that, the thesis contains a detailed study of a wavelet-based compres-

sion algorithm, which obtains encouraging results for what concerns both

compression ratio and reconstruction error. The algorithm may find a suit-

able application as a second level compressor on SDD data in the case that

it might become necessary to switch off the compression algorithm imple-

mented on CARLOS.

The thesis is structured in the following way:

• Chapter 1 contains a description of the ALICE experiment, especially

for what concerns the SDD readout architecture.

• Chapter 2 contains an introduction to standard compression algorithms.

• Chapter 3 contains a description of the 1D algorithm developed at the

INFN Section of Torino and the two prototypes CARLOS v1 and v2.

• Chapter 4 focuses on the 2D compression algorithm and on the design

and implementation of the prototype CARLOS v3.

• Chapter 5 contains a description of a wavelet-based compression algo-

rithm especially tuned to reach high performances on SDD data and

its possible application to a second level compressor in counting room.

x

Chapter 1

The ALICE experiment

ALICE (A Large Ion Collider Experiment) [1] is an experiment at the Large

Hadron Collider (LHC) [2] optimized for the study of heavy-ion collisions,

at a centre-of-mass energy of 5.5 TeV per nucleon. The main aim of the

experiment is to study in details the behaviour of nuclear matter at high

densities and temperatures, in view of probing deconfinment and chiral sym-

metry restoration.

The detector [1, 3] consists essentially of two main components: the central

part, composed of detectors mainly devoted to the study of hadronic signals

and dielectrons, and the forward muon spectrometer, devoted to the study

of quarkonia behaviour in dense matter. The layout of the ALICE set-up is

shown in Fig. 1.1.

A major technical challenge is imposed by the large number of particles cre-

ated in the collisions of lead ions. There is a considerable spread in the

currently available predictions for the multiplicity of charged particles pro-

duced in a central Pb-Pb collision. The design of the experiment has been

based on the highest value, 8000 charged particles per unit of rapidity, at

midrapidity. This multiplicity dictates the granularity of the detectors and

their optimal distance from the colliding beams. The central part, which

covers ±45 (η ≤ 0.9) over the full azimuth, is embedded in a large magnet

with a weak solenoidal field. Outside of the Inner Tracking System (ITS),

there are a cylindrical TPC (Time Projection Chamber) and a large area PID

array of time-of-flight (TOF) counters. In addition, there are two small-area

1


Figure 1.1: Longitudinal section of the ALICE detector

single-arm detectors: an electromagnetic calorimeter (Photon Spectrometer,

PHOS) and an array of RICH counters optimized for high-momentum inclu-

sive particle identification (HMPID).

My thesis work has been focused on data coming from one of the three de-

tectors forming the ITS, the Silicon Drift Detector (SDD).

1.1 The Inner Tracking System

The basic functions of the ITS [4] are:

• determination of the primary vertex and of the secondary vertices nec-

essary for the reconstruction of charm and hyperon decays;

• particle identification and tracking of low-momentum particles;

• improvement of the momentum and angle measurements of the TPC.

2

1.1 — The Inner Tracking System

1.1.1 Tracking in ALICE

Track finding in heavy-ion collisions at the LHC presents a big chal-

lenge, because of the extremely high track density. In order to achieve

a high granularity and a good two-track separation, ALICE uses three-

dimensional hit information, wherever feasible, with many points on

each track and a weak magnetic field. The ionization density of each

track is measured for particle identification. The need for a large num-

ber of points on each track has led to the choice of a TPC as the main

tracking system. In spite of its drawbacks, concerning speed and data

volume, only this device can provide reliable performance for a large

volume at up to 8000 charged particles per unit of rapidity. The min-

imum possible inner radius of the TPC (rin = 90 cm) is given by the

maximum acceptable hit density. The outer radius (rout = 250 cm)

is determined by the minimum length required for a dE/dx resolution

better than 10 %. At smaller radii, and hence larger track densities,

tracking is taken over by the ITS.

The ITS consists of six cylindrical layers of silicon detectors. The num-

ber and position of the layers are optimized for efficient track finding

and impact parameter resolution. In particular, the outer radius is

determined by the track matching with the TPC, and the inner one

is the minimum compatible with the radius of the beam pipe (3 cm).

The silicon detectors feature the high granularity and excellent spatial

precision required.

Because of the high particle density, up to 90 cm−2, the four inner-

most layers (r ≤ 24 cm) must be truly two-dimensional devices. For

this task, silicon pixel and silicon drift detectors were chosen. The

outer two layers at r = 45 cm, where the track densities are below

1 cm−2, are equipped with double-sided silicon micro-strip detectors.

With the exception of the two innermost pixel planes, all layers have

analog readout for particle identification via a dE/dx measurement

in the non-relativistic region. This gives the inner tracking system a

stand-alone capability as a low-pt particle spectrometer.

3


1.1.2 Physics of the ITS

The ITS will contribute to the track reconstruction by improving the

momentum resolution obtained by the TPC. This will be beneficial for

practically all physics topics which will be addressed by the ALICE ex-

periment. The global event features will be studied by measuring the

multiplicity distributions and the inclusive particle spectra. For the

study of resonance production (ρ, ω and φ), and, more important, the

behaviour of the mass and width of these mesons in the dense medium,

the momentum resolution is even more important. We have to achieve

a mass precision comparable to, or better than, the natural width of

the resonances in order to observe changes of their parameters caused

by chiral symmetry restoration. Also the mass resolution for heavy

states, like D mesons, J/ψ and Υ, will be better, thus improving the

signal-to-background ratio in the measurement of the open charm pro-

duction, and in the study of heavy-quarkonia suppression. Improved

momentum resolution will enhance the performances in the observa-

tion of another hard phenomenon, the jet production and predicted jet

quenching, i.e. the energy loss of partons in strongly interacting dense

matter.

The low-momentum particles (below 100 MeV/c) will be detectable

only by the ITS. This is of interest in itself, because it widens the mo-

mentum range for the measurement of particle spectra, which allows

collective effects associated with the large length scales to be studied.

In addition, a low-pt cut-off is essential to suppress the soft gamma

conversions and the background in the electron-pair spectrum due to

Dalitz pairs. Also the PID capabilities of the ITS in the non-relativistic

(1/β2) region will therefore be of great help.

In addition to the improved momentum resolution, which is necessary

for the identical particle interferometry, especially at low momenta, the

ITS will contribute to this study through an excellent double-hit reso-

lution enabling the separation of tracks with close momenta. In order

to be able to study particle correlations in the three components of

4


their relative momenta, and hence to get information about the space

time evolution of the system produced in heavy-ion collisions at the

LHC, we need sufficient angular resolution in the measurement of the

particle’s direction. Two of the three components of the relative mo-

mentum (the side and longitudinal ones) are crucially dependent on

the precision with which the particle direction is known. The angular

resolution is determined by the precise ITS measurements of the pri-

mary vertex position and of the first points on the tracks. The particle

identification at low momenta will enhance the physics capability by

allowing the interferometry of individual particle species as well as the

study of non-identical particle correlations, the latter giving access to

the emission time of different particles.

The study of strangeness production is an essential part of the ALICE

physics program. It will allow the level of chemical equilibration and

the density of strange quarks in the system to be established. The mea-

surement will be performed by charge kaon identification and hyperon

detection, based on the ITS capability to recognize secondary vertices.

The observation of multi-strange hyperons (Ξ− and Ω−) is of particular

interest, because they are unlikely to be produced during the hadronic

rescattering due to the high-energy threshold for their production. In

this way we can obtain information about the strangeness density of

the earlier stage of the collision.

Open charm production in heavy-ion collisions is of great physics in-

terest. Charmed quarks can be produced in the initial hard parton

scattering and then only at the very early stages of the collision, while

the energy in parton rescattering is above the charm production thresh-

old. The charm yield is not altered later. The excellent performance of

the ITS in finding the secondary vertices close to the interaction point

gives us the possibility to detect D mesons, by reconstructing the full

decay topology.

5


Figure 1.2: ITS layers

1.1.3 Layout of the ITS

A general view of the ITS is shown in Fig. 1.2. The system consists

of six cylindrical layers of coordinate-sensitive detectors, covering the

central rapidity region (η ≤ 0.9) for vertices located within the length

of the interaction diamond (2σ), i.e. 10.6 cm along the beam direction

(z). The detectors and front-end electronics are held by lightweight

carbon-fibre structures. The geometrical dimensions and the main fea-

tures of the various layers of the ITS are summarized in Table 1.1.

The granularity required for the innermost planes is achieved with

silicon micro-pattern detectors with true two-dimensional readout: Sil-

icon Pixel Detectors (SPD) and Silicon Drift Detectors (SDD). At larger

radii, the requirements in terms of granularity are less stringent, there-

fore double-sided Silicon Strip Detectors (SSD) with a small stereo

angle are used. Double-sided microstrips have been selected rather

than single-sided ones because they introduce less material in the ac-

tive volume. In addition they offer the possibility to correlate the pulse

height read out from the two sides, thus helping to resolve ambiguities

inherent in the use of detectors with projective readout. The main

parameters for each of the three detector types are: spatial precision,

two-track resolution, pixel size, number of channels of an individual

detector, total number of electronic channels are shown in Table 1.1.

6


Parameter Pixel Drift Strip

Spatial precision rφ µm 12 38 20

Spatial precision z µm 70 28 830

Two-track resolution rφ µm 100 200 300

Two-track resolution z µm 600 600 2400

Cell size µm2 50 x 300 150 x 300 95 x 40000

Active area mm2 13.8× 82 72.5× 75.3 73× 40

Readout channels per module 65536 2 x 256 2 x 768

Total number of modules 240 260 1770

Total number of readout channels k 15729 133 2719

Total number of cells M 15.7 34 2.7

Average occupancy (inner layer) 1.5 2.5 4

Average occupancy (outer layer) 0.4 1.0 3.3

Table 1.1: Main features of ITS detectors

The large number of channels in the layers of the ITS requires a large

number of connections from the front-end electronics to the detector

and to the data acquisition system. The requirement for a minimum of

material within the acceptance does not allow the use of conventional

copper cables near the active surfaces of the detection system. There-

fore Tape Automatic Bonded (TAB) aluminium multilayer microcables

are used.

The detectors and their front-end electronics produce a large amount

of heat which has to be removed while keeping a very high degree of

temperature stability. In particular, the SDDs are sensitive to temper-

ature variations in the 0.1 C range. For these reasons, particular care

was taken in the design of the cooling system and of the temperature

monitoring. A water cooling system at room temperature is the chosen

solution for all ITS layers, but the use of other liquid coolants is still

being considered. For the temperature monitoring dedicated integrated

circuits are mounted on the readout boards and specific calibration de-

vices are integrated in the SDDs.

The outer four layers of the ITS detectors are assembled onto a me-

7


Figure 1.3: SDD prototype: 1) active area, 2) guard area.

chanical structure made of two end-cap cones connected by a cylinder

placed between the SSD and the SDD layers. Both the cones and the

cylinder are made of lightweight sandwiches of carbon-fibre plies and

Rohacell TM . The carbon-fibre structure includes also the appropri-

ate mechanical links to the TPC and to the SPD layers. The latter

are assembled in two half-cylinder structures, specifically designed for

safe installation around the beam pipe. The end-cap cones provide the

cabling and cooling connection of the six ITS layers with the outside

services.

1.2 Design of the drift layers

SDDs (a picture is shown in Fig. 1.3) have been selected to equip the

two intermediate layers of the ITS, since they couple a very good multi-

track capability with dE/dx information. At least three measured

samples per track, and therefore at least four layers carrying dE/dx

information are needed. The SDDs, 7.25× 7.53 cm2 active area each,

8

1.2 — Design of the drift layers

Figure 1.4: Longitudinal section of ITS layer 3 and layer 4

will be mounted on linear structures called ladders, each holding six

detectors for layer 3 and eight detectors for layer 4 (see Fig. 1.4).

The layers will sit at the average radius of 14.9 and 23.8 cm from

the beam pipe and will be composed of 14 and 22 ladders respectively.

The front-end electronics will be mounted on rigid heat-exchanging hy-

brids, which in turn will be connected onto cooling pipes running along

the ladder structure. The connections between the detectors and the

front-end electronics, and between both and the ends of the ladders will

be assured with flexible Al microcables, TAB bonded, which will carry

both data and power supply lines. Each detector will be first assembled

together with its front-end electronics and high-voltage connections as

9


++

+−−−

p p p p p

p p p p pp

n+ + + + +

+ + + + + +

+

x

y

z

Figure 1.5: Working mode of a SDD detector

a unit, hereafter called a module, which will be fully tested before it is

mounted on the ladder.

1.3 The SDDs (Silicon Drift Detectors)

SDDs, like gaseous drift detectors, exploit the measurement of the

transport time of the charge deposited by a transversing particle to

localize the impact point in two dimensions, thus enhancing resolution

and multi-track capability at the expense of speed. They are therefore

well suited to this experiment in which very high particle multiplicities

are coupled with relatively low event rates (up to some KHz). A linear

SDD, shown schematically in Fig. 1.5, has a series of parallel implanted

p+ field strips, connected to a voltage divider on both surfaces of the

high-resistivity n-type silicon wafer. The voltage divider is integrated

on the detector substrate itself. The field strips provide the bias voltage

to fully deplete the volume of the detector and they generate an elec-

trostatic field parallel to the wafer surface, thus creating a drift region

(see Fig. 1.6). Electron-hole pairs are created by the charged parti-

cles crossing the detector. The holes are collected by the nearest p+

electrode, while the electrons are focused into the middle plane of the

detector and driven by the drift field towards the edge of the detector

10

1.3 — The SDDs (Silicon Drift Detectors)

Figure 1.6: Potential energy of electrons (negative electric potential) on

the y-z plane of the device

where they are collected by an array of anodes composed of n+ pads.

So far an electronic charge cloud drifts from the impact point to the an-

ode region: the cloud shows a bell-shaped Gaussian distribution that,

owing to the diffusion and mutual repulsion, during the drift becomes

smaller and larger [5] (see Fig. 1.7). In this way a charge cloud can

be collected by one or more anodes depending on the charge released

by the ionizing particle and on the impact position with respect to the

anode region. The small size of the anodes, and hence their small ca-

pacitance (50 fF), imply low noise and good energy resolution.

The coordinate perpendicular to the drift direction is given by the cen-

troid of the collected charge. The coordinate along the drift direction is

measured by the centroid of the signal in the time domain, taking into

account the amplifier response. A space precision, averaged over the

full detector surface, better than 40 µm in both coordinates has been

obtained during beam tests of full-size prototype detectors. Each SDD

module is divided in two half-detectors : each half-detector contains on

the external side 256 anodes at a distance of 300 µm from each another.

So far each SDD detector contains 2 x 256 readout channels: taking

into account that the layer 3 and 4 contain 260 SDD modules, the total

number of SDD readout channels is around 133k.

11


Drift

Anode axis

Tim

e ax

is

Figure 1.7: Charge distribution evolution scheme

1.4 SDD readout system

The system requirements for the SDD readout system derive from both

the features of the detector and the ALICE experiment in general. The

following points are crucial in the definition of the final readout system:

– The signal generated by the SDD is a Gaussian shaped current

signal, with variable sigma and charge (5-30 ns and 4 to 32 fC)

and can be collected by one or more anodes. Therefore the front-

end electronics should be able to handle analog signals in a wide

dynamic range. Then, the system noise should be very low while

being able to handle large signals.

– The amount of data generated by the SDD is very large: each half

detector has 256 anodes and for each anode 256 time samples have

to be taken in order to cover the full drift length.

– The small space available on the ladder and the constraints on

material impose an architecture which minimizes cabling.

– The radiation environment in which the front-end electronics has

to work imposes the choice of a radiation tolerant technological

12

1.4 — SDD readout system

End ladder module

Front−end module

SDD detectors

PASCALAMBRA

CARLOSSIU

Test and slow control

.

.

.

Figure 1.8: SDD ladder electronics

library for the implementation of the electronics.

The chosen SDD readout electronics, shown in Fig. 1.8, consists of

front-end modules and end-ladder modules. The front-end module per-

forms analog data acquisition, A/D conversion and buffering, while the

end-ladder module contains high voltage and low voltage regulators and

a chip for data compression and interfacing the ALICE DAQ system.

13


Figure 1.9: The front-end readout unit

1.4.1 Front-end module

The front-end modules, one per half-detector, are distributed along the

ladders together with the SDD modules. Each front-end module con-

tains 4 PASCAL (Preamplifier, Analog Storage and Conversion from

Analog to digitaL) - AMBRA (A Multievent Buffer Readout Archi-

tecture) chips pairs, as shown in Fig. 1.9. The PASCAL chips are

TAB-bonded directly on the SDD output anodes, while the AMBRA

chips are connected to CARLOS (Compression And Run Length en-

cOding Subsystem) via an 8-bit bus.

Each PASCAL chip contains three functional blocks (see Fig. 1.10):

– low noise preamplifiers (they are 64, one for each anode);

– an analog memory working at a 40 MHz clock frequency (64×256

cells);

– 10-bit analog to digital converters ADC, (they are 64, one for each

channel).

During the write phase, i.e. when no trigger signal has been received,

the preamplifiers continuosly write the samples into the analog memory

14


ADC

ADC

ADC

ADC

ADC...

...

...

...

...

A/D conversion, buffering and multiplexing

Interface control unit

start_op

end_op

write_req

write_ack

jtag_bus

data_out

Analog memorycontrol unit

Analog memory

clock

reset

Preamplifiers

data_in[2]

data_in[1]

data_in[0]

data_in[62]

data_in[63]

pa_cal

Figure 1.10: PASCAL chip architecture

cells at 40 MHz, while the ADCs are in stand-by mode. When PAS-

CAL receives a trigger signal from CARLOS (that receives it from the

Central Trigger Processor, CTP) , a control logic module on the PAS-

CAL chip stops the analog memory write phase, freezes its contents

and starts the read phase, performed in two steps: in the first step the

ADCs are set to sample mode and the analog memory reads out the

first sample for each anode row; after the memory settling time, the

ADCs switch to the conversion mode and analog data are converted

to digital through a successive approximation technique. When the

conversion is finished, the control logic module on PASCAL starts the

15


Input range Output codes Code mapping Bits lost

0-127 from 128 to 128 0xxxxxxx 0

128-255 from 128 to 32 100xxxxx 2

256-511 from 256 to 32 101xxxxx 3

512-1023 from 512 to 64 11xxxxxx 3

Table 1.2: Digital compression from 10 to 8 bits

readout of the next sample from the analog memory and, at the same

time, sends the 64 digital words to the AMBRA chip using a 40-bit

wide bus. The read phase goes on until all the analog memory con-

tent has been converted to digital values or an abort signal comes from

CARLOS (again receiving it from the CTP), meaning that the event

has to be discarded.

The AMBRA chip has mainly two functions: first, AMBRA has to

compress data from 10 to 8 bits per sample, then it has to store the

input data stream into a digital buffer. The principle used for compres-

sion is to decrease the resolution for larger signals with a logarithmic

or square-root law using the mapping shown in Table 1.2. Since the

larger signals have better signal to noise ratio than the smaller ones,

the accuracy of the measurement is not affected.

The 4 AMBRA chips are static RAM able to contain 256 KBytes,

thus being able to temporarily store 4 half-SDD complete events (one

event corresponds to 256 × 256 Bytes = 64 KBytes). Data read/write

stages are allowed at the same time: so far while the PASCAL chips

are transferring data to the AMBRA ones, the AMBRA chips can send

data belonging to an other event to the CARLOS chip. Actually, since

four AMBRA chips have to transmit data over a single 8-bit bus, an

arbitration mechanism has been implemented.

16


1.4.2 Event-buffer strategy

The dead time due to the SDD readout system is around 358.4 µs: this

is, in fact, the time needed for reading a cell of the analog memory and

for converting it into a digital word, 1.4 µs, multiplied by the number

of cells, 256. This means that a new trigger signal will not be accepted

before 358.4 µs have passed after the previous event. Every 1.4 µs each

detector produces 512 bytes of data, then at least 10 8-bit buses per

detector working at 40 MHz are required for data transfer. Unfortu-

nately the space on the ladder is very limited and managing 80 data

lines for each detector (for a total of 320 for the half-ladder) is a very

serious problem, especially for the input connections to the end-ladder

readout units.

The adopted solution to insert a digital multi-event buffer on the front-

end readout unit between PASCAL and CARLOS allows to send data

towards the end-ladder unit at a lower speed, in fact if an other event

arrives while transmitting data from AMBRA to CARLOS, an other

digital buffer on AMBRA is ready to accept data coming from PAS-

CAL. Data is transferred from AMBRA to CARLOS using an 8-bit

bus in 1.65 ms (25 ns x 64 Kwords) while other events are processed

by PASCAL and sent to AMBRA. For an average Pb-Pb event rate of

40 Hz and using a double-event digital buffer, our simulations indicate

that the dead time due to buffer overrun is only 0.1 % of the total time.

This is the amount of time during which AMBRA is transferring data

to CARLOS and the other buffer in AMBRA is full: in this situation

a BUSY signal is asserted towards the CTP, meaning that no further

trigger can be accepted. In order to reach a much smaller amount of

dead time even with higher event rates, a decision was taken to have a

4-buffer-deep AMBRA device.

In order to allow the full testability of the readout electronics at the

board and system levels, the ASICs embody a JTAG standard interface.

In this way it is possible to test each chip after the various assembly

stages and during the run phase in order to check correct functionality.

17


Layer Ladders Detectors/ladder Data/ladder Total data

3 14 6 768 KBytes 10.5 MBytes

4 22 8 1 MByte 22 MBytes

Both 32.5 MBytes

Table 1.3: Total amount of data produced by SDDs

The same interface is used to download control information into the

chips.

Radiation tolerant deep-submicron processes (0.25 µm) has been used

for the final versions of the ASICs. These technologies are now available

and allow us to reduce size and power consumption with no degradation

of the signal processing speed. Moreover, it has been shown that they

have a better resistance to radiation when specific layout techniques

are used, if compared to commercially available technologies.

1.4.3 End-ladder module

The end-ladder modules are located at both ends of each ladder (2

per ladder); they receive data from the front-end modules, perform

data compression with the CARLOS chip and send data to the DAQ

through an optical fibre link.

Beside that, the end-ladder board will host the TTCrx device, a

chip receiving the global clock and trigger signals from the CTP and

distributing it to PASCAL, AMBRA and CARLOS, and the power reg-

ulators for the complete ladder system.

CARLOS receives 8 data streams coming from 8 half-detectors, i.e.

from one half-ladder, for a total volume of data of 64 KBytes × 8 =

512 KBytes, at a rate of 320 MByte/s in input. Taking into account the

number of ladders and detectors per ladder (see Table 1.3), the total

volume of data produced by all the SDD modules amounts to around

22 MBytes per event, while the space reserved on disk for permanent

storage is 1.5 MBytes. This implies to use a compression algorithm

18


with a compression coefficient of at least 22 and a reconstruction er-

ror as low as possible, in order to minimize physical information loss.

Moreover since the trigger rate in proton-proton interactions amounts

to 1 KHz, each event should be compressed and sent to the DAQ sys-

tem within 1 ms. Actually, thanks to the buffering provided by the

AMBRA chips, this processing time doubles to 2 ms, thus relaxing the

timing constraint on the CARLOS chip.

These constraints led us to the design and implementation of a first

prototype of CARLOS. Then the desire to have better compression

performances and changes in the readout architecture due to the pres-

ence of radiations led us to the design and implementation of other two

CARLOS prototypes. We are now going to design CARLOS v4 that

is intended to be the final version of the compression ASIC. The first

3 prototypes of the device CARLOS are explained in details in chap-

ters 3 and 4, while chapter 2 contains a review of existent compression

techniques.

1.4.4 Choice of the technology

The effects of radiations on electronics circuits can be divided in total

dose effects and single event effects (SEU) [6]. Total dose modifies the

thresholds of MOS transistors and increases leakage currents. This is of

particular concern in leakage sensitive analog circuits, like analog mem-

ories. For instance, assuming for the storage capacitors in the memory

a value of 1 pF, a leakage current as small as 1 nA would change the

value of the stored information by 0.2 V in 200 µs. This is of course

unacceptable.

Radiation tolerant layout practices prevent this risk and their use in

analog circuits is therefore recommended. These designs techniques be-

come extremely effective in deep-submicron CMOS technologies. Single

event effects can trigger latch-up phenomena or can change the value

of digital bits (Single Event Upset). Latch-up can be prevented with

the systematic use of guard rings in the layout. Single event upset can

19


be a problem especially when occurring in the digital control logic and

can be prevented by layout techniques or by redundancy in the sys-

tem. Radiation tolerant layouts have of course area penalties. It can

be estimated that in a given technology a minimum size inverter with

radiation tolerant layout is 70% bigger than the corresponding inverter

with standard layout. Nevertheless, a radiation tolerant inverter in a

quarter micron technology is about eight times smaller than a standard

inverter in a 0.8 µm technology. The radiation dose which will be re-

ceived by the readout electronics will be quite low, below 100 Krad in

10 years. This value is probably below the limit of what a standard

technology can afford; however conservative considerations suggested

the use of radiation tolerant techniques for critical parts of the circuit.

These techniques have been proven to work up to 30 MRad and allow

a lower area penalty and lower cost compared with the radiation hard

processes. So far the library chosen for the implementation of PAS-

CAL, AMBRA and CARLOS chips is the 0.25 µm IBM technology

with standard cells designed at CERN to be radiation tolerant.

20

Chapter 2

Data compression techniques

Data compression [7] is the art of science of representing information in

a compact form. These compact representations are created by identify-

ing and using structures that exist in the data. Data can be characters

in a text file, numbers that are samples of speech or image waveforms

or sequences of numbers that are generated by physical processes.

Data compression plays an important role in many fields, for example

in digital television signals transmission. If we wanted to transmit an

HDTV (High Definition TeleVision) signal without any compression, we

would need to transmit about 884 Mbits/s. Using data compression,

we need to transmit less than 20 Mbits/s along with audio information.

Compression is now very much a part of everyday life. If you use com-

puters you are probably using a variety of products that make use of

compression. Most modems now have compression capabilities that al-

low to transmit data many times faster than otherwise possible. File

compression utilities, that permit us to store more on our disks, are

now commonplace.

This chapter contains an introduction to data compression with a de-

scription of the most commonly used compression algorithms, with the

aim of finding out the most suitable compression technique for physical

data coming out from the SDD.

21


2.1 Applications of data compression

An early example of data compression is the Morse code, developed

by Samuel Morse in the mid-19th century. Letters sent by telegraph

are encoded with dots and dashes. Morse noticed that certain letters

occurred more often than others. In order to reduce the average time

required to send a message, he assigned shorter sequences to letters that

occur more frequently such as a (· −) and e (·) and longer sequences to

letters that occur less frequently such as q (− − · −) or j (· − − −).

What is being used to provide compression in the Morse code is the

statistical structure of the message to compress, i.e. the message con-

tains letters with a probability to occurr higher than others. So far

most compression techniques exploit the input statistical structure to

provide compression, but this is not the only kind of structure that

exists in the data.

There are many other kinds of structures in data of differents types that

can be exploited for compression. Let us take speech as an example.

When we speak, the physical construction of our voice box dictates the

kinds of sounds that we can produce, that is the mechanics of speech

production impose a structure on speech. Therefore, instead of trans-

mitting the sampled speech itself we could send information about the

conformation of the voice box, which could be used by the receiver to

synthesize the speech. An adequate amount of information about the

conformation of the voice box can be represented much more compactly

than the sampled values of the speech. This compression approach is

being used currently in a number of applications, including transmis-

sion of speech over mobile radios and the synthetic voice in toys that

speak.

Data compression can also take advantage of some redundant structure

of the input signal, that is a structure containing more information than

needed. For example if a sound has to be transmitted for being heard

by a human being, all frequencies below 20 Hz and above 20 KHz

can be eliminated (thus providing compression) since these frequencies

22

2.2 — Remarks on information theory

cannnot be perceived by humans.

2.2 Remarks on information theory

Without going into details we just want to recall Shannon’s theorem [8].

He defines the information contents of a message in the following way:

given a message which is made up of N characters in total containing

n different symbols, the information contents measured in bits of the

message is the following:

I = Nn∑

i=1

(−pilog(pi)) (2.1)

where pi is the occurrence probability of symbol i.

What is regarded as a symbol depends on the application: it might be

an ASCII code, 16 or 32 bit words, words in a text and so on.

A practical illustration of the Shannon theorem is the following: let

us assume to measure a charge or any other physical quantity using

an 8-bit digitizer. Very often measured quantities will be distributed

approximately exponentially. Let us assume that the mean value of

the statistical distribution is one tenth of the dynamic range, i.e. 25.6.

Each value between 0 and 255 is regarded as a symbol. Applying the

Shannon’s formula with n = 256 and pi = e−(i+0.5)

25.6

25.6we obtain a mean

information content I/N of 6.11 bits per measured value which is al-

most 25% less than the 8 bits we need saving the data as a sequence

of bytes. Even if we had increased the dynamic range by a factor of 4

using a 10-bit ADC, it turns out that the mean information contents

expressed as the number of bits per measurement would have been vir-

tually the same and hence the possible compression gain even higher

(39%). This might be surprising but considering that an exponential

distribution delivers a value beyond ten times the mean only every e10

= 22026 samples, it is clear that even using a quite long code for such

measurements cannot have an appreciable influence on the compression

23


rates. Considering that with all likelihood in a realistic architecture we

would have had to expand the 10 bits to 16, the gain is impressive 62%

in the latter case.

The exponential distribution is a good approximation of the raw data in

many cases and in particular for data coming out from the SDD. Com-

paring various probability distributions with the same RMS it seems

that the exponential distribution is particularly hard to compress. For

instance a discrete spectrum being distributed according to a Gaussian

with the same RMS as the above exponential only has an information

contents of 4.75 bits.

2.3 Compression techniques

When we speak of a compression technique or a compression algorithm

we actually refer to two algorithms: the first one takes an input X

and generates a representation XC that requires fewer bits; the sec-

ond one is a reconstruction algorithm that operates on the compressed

representation XC to generate the reconstruction Y . Based upon the

requirements of reconstruction, data compression schemes can be di-

vided into two broad classes:

– lossless compression schemes, in which Y is identical to X;

– lossy compression schemes, which generally provide much higher

compression than lossless ones, but force Y to be different from

X.

In fact Shannon showed that the best performance achievable by a

lossless compression algorithm is to encode a stream with an average

number of bits equal to the I/N value. On the contrary lossy algorithms

do not have upper bounds to the compression ratio.

24

2.3 — Compression techniques

2.3.1 Lossless compression

Lossless compression techniques involve no loss of information. If data

have been losslessly compressed, the original data can be recovered

exactly from the compressed data. Lossless compression is generally

used for discrete data, such as text, computer-generated data and some

kind of image and video information. There are many situations that

require compression where we want the reconstruction to be identical

to the original. There are also a number of situations in which it is

possible to relax this requirement in order to get more compression: in

these cases lossy compression techniques have to be used.

2.3.2 Lossy compression

Lossy compression techniques involve some loss of information and data

that have been compressed using lossy techniques generally cannot be

recovered or reconstructed exactly. In return for accepting distortion in

the reconstruction, we can generally obtain much higher compression

ratios than it is possible with lossless compression. Whether the distor-

tion introduced is acceptable or not depends on the specific application:

for instance if the input source X contains a physical information plus

noise, while the output Y contains only the physical signal, the distor-

tion introduced is completely acceptable.

2.3.3 Measures of performance

A compression algorithm can be evaluated in a number of different

ways. We could measure the relative complexity of the algorithm, the

memory required to implement the algorithm, how fast the algorithm

performs on a given machine or on dedicated hardware, the amount of

compression and how closely the reconstruction resembles the original.

The last two features are the most important ones for our application

to SDD data.

25


A very logical way of measuring how well a compression algorithm com-

presses a given set of data is to look at the ratio of the number of bits

required to represent the data before compression to the number of bits

required to represent the data after compression. This ratio is called

compression ratio. Suppose of storing an image made up of a square

array of 256x256 8-bit pixels (exactly as a half SDD): it requires 64

KBytes. If the compressed image requires only 16 KBytes we would

then say that the compression ratio is 4.

Another way of reporting compression performance is to provide the

average number of bits required to represent a single sample. This is

generally referred to as the rate. For instance, for the same image de-

scribed above, the average number of bits per pixel in the compressed

representation is 2: thus the rate is 2 bits/pixel.

In lossy compression the reconstruction differs from the original data.

Therefore, in order to determine the efficiency of a compression algo-

rithm, we have to find some way to quantify the difference. The dif-

ference between the original data and the reconstructed ones is often

called distortion. This value is usually calculated as a mathematical or

percentual difference among data before and after compression.

2.3.4 Modelling and coding

The development of data compression algorithms for a variety of data

can be divided in two steps. The first phase is usually referred to

as modelling. In this phase we try to extract information about any

redundancy that exists in the data and describe the redundancy in the

form of a model. The second phase is called coding. The description of

the model and a description of how the data differ from the model are

encoded, generally using a binary alphabet.

26

2.4 — Lossless compression techniques

2.4 Lossless compression techniques

This section contains an explanation of the most widely used lossless

compression techniques. In particular the following items are covered:

– Huffman coding;

– run length encoding;

– differential encoding;

– dictionary techniques;

– selective readout.

Some of these algorithms have been chosen for direct application in the

1D compression algorithm implemented in the prototypes CARLOS v1

and v2.

2.4.1 Huffman coding

Huffman based compression algorithm [7] encodes data samples in this

way: symbols that occur more frequently (i.e. symbols having a higher

probability of occurrence) will have shorter codewords than symbols

that occurr less frequently. This leads to a variable-length coding

scheme, in which each symbol can be encoded with a different number

of bits. The choice of the code to assign to each symbol or, in other

words, the design of the Huffman look-up table is carried out with stan-

dard criteria.

An example can better explain this sentence. Suppose to have 5 data,

a1, a2, a3, a4 and a5, each one with a probability of occurrence, P (a1) =

0.2, P (a2) = 0.4, P (a3) = 0.2, P (a4) = 0.1, P (a5) = 0.1; at first, in

order to write down the encoding c(ai) of each data ai, it is necessary

to order data from the higher probable to the lower probable one, as

shown in Tab. 2.1.

27


Data Probability Code

a2 0.4 c(a2)

a1 0.2 c(a1)

a3 0.2 c(a3)

a4 0.1 c(a4)

a5 0.1 c(a5)

Table 2.1: Sample data and probability of occurrence

The least probable data are a4 and a5; they are assigned the following

codes:

c(a4) = α1 ∗ 0 (2.2)

c(a5) = α1 ∗ 1 (2.3)

where α1 is a generic binary string and ∗ represents the concatenation

between two strings.

If a′4 is a data for which the following relationship holds true P (a′4) =

P (a4) + P (a5) = 0.2, then data in Tab. 2.1 can be reordered from the

higher to the lower probable, as shown in Tab. 2.2.


a2 0.4 c(a2)

a1 0.2 c(a1)

a3 0.2 c(a3)

a′4 0.2 α1

Table 2.2: Introduction of data a′4

In this table lower probability data are a3 and a′4: so far they can be

encoded in the following way:

c(a3) = α2 ∗ 0 (2.4)

c(a′4) = α2 ∗ 1 (2.5)

Nevertheless, being c(a′4) = α1, from Tab. 2.2, then from (2.5) follows

28


that α1 = α2 ∗ 1, e then, (2.2) and (2.3) become:

c(a4) = α2 ∗ 10 (2.6)

c(a5) = α2 ∗ 11 (2.7)

Defining a′3 as the data for which P (a′3) = P (a3) + P (a′4) = 0.4, data

from Tab. 2.2 can be reordered from the higher probable to the lower

probable as shown in Tab. 2.3.


a2 0.4 c(a2)

a′3 0.4 α2

a1 0.2 c(a1)

Table 2.3: Introduction of data a′3

In Tab. 2.3 lower probability data are a′3 and a1; so far they can be

encoded in the following way:

c(a′3) = α3 ∗ 0 (2.8)

c(a1) = α3 ∗ 1 (2.9)

Being c(a′3) = α2, from Tab. 2.3, then from (2.8) follows α2 = α3 ∗ 0,

so far (2.4), (2.6) and (2.7), become:

c(a3) = α3 ∗ 00 (2.10)

c(a4) = α3 ∗ 010 (2.11)

c(a5) = α3 ∗ 011 (2.12)

Finally, by defining a′′3 as the data for which the following relationship

holds true P (a′′3) = P (a′3) + P (a1) = 0.6, data from Tab. 2.3 can be

reordered from the higher probable to the lower probable as shown in

Tab. 2.4.

29



a′′3 0.6 α3

a2 0.4 c(a2)

Table 2.4: Introduction of data a′′3

Only two data being left, the encoding is immediate:

c(a′′3) = 0 (2.13)

c(a2) = 1 (2.14)

Beside that, being c(a′′3) = α3, as shown in Tab. 2.4, then from (2.13)

the following relationship becomes α3 = 0, i.e., (2.9), (2.10), (2.11) and

(2.12), can be written as:

c(a1) = 01 (2.15)

c(a3) = 000 (2.16)

c(a4) = 0010 (2.17)

c(a5) = 0011 (2.18)

Tab. 2.5 contains a complete view of the Huffman table so far generated.

The method used for building the Huffman table in this example can

be applied as it is to every data stream having whichever statistical

structure. Huffman codes c(ai), so far generated, can be univoquely

decoded: this means that from a sequence of variable length codes

c(ai) created using the Huffman coding, only one data sequence ai can

be reconstructed.

Beside that, as shown in the example in Tab. 2.5, none of the codes

c(ai) is contained as a prefix in the remaining codes; codes following

this property are named prefix codes. In particular prefix codes also

follow the property of being univoquely decodable, while the contrary

does not always hold true.

Finally an Huffman code is defined an optimum code since, among all

the prefix codes, it is the one that minimizes the average code length.

30



a2 0.4 1

a1 0.2 01

a3 0.2 000

a4 0.1 0010

a5 0.1 0011

Table 2.5: Huffman table

2.4.2 Run Length encoding

Very often a data stream happens to contain long sequences of the

same value: this may happen when a physical quantity holds the same

value for several sampling periods, it can happen in text files where a

character can be repeated several times, it can happen in digital images

where spaces with the same color are encoded with pixels with the same

value, and so on. The compression algorithm based on the Run Length

[9] encoding is well suited for such repetitive data.

As shown in the example in Fig. 2.1, where the zero symbol has been

chosen as the repetitive data in the sequence, each zero sequence in the

original sequence is encoded as a couple of words: the first contains

the code for the zero symbol, the second contains the number of zero

symbols consecutively occurred in the original sequence.

The performances of the algorithm get better, in terms of compression

ratio, when the input data stream contains long sub-sequences of the

same symbol and when it contains few single subsequences, such as the

second code, 0→00, in Fig. 2.1. Finally this compression algorithm can

be implemented in different ways: it can be applied only on one value

of the original data sequence or on different elements of the sequence.

One of the most important applications of the Run Length encoding

system is the compression of facsimile or fax. In facsimile transmission a

page is scanned and converted into a sequence of white and black pixels:

since it is highly probable to have very long sequences of white or black

pixels, coding the lengths of runs instead of coding individual pixels

31


302345

17 8 54 0 0 0 97 5 16 0 45 23 0 0 0 0 43

17 8 54 0 2 97 5 16 0 0 43

Original sequence

Run Length

encoded sequence

Figure 2.1: Run length encoding

leads to high compression ratios. Beside that Run Length encoding is

often used in conjunction with other compression algorithms, after the

input data stream has been transformed in a more compressible form.

2.4.3 Differential encoding

Differential encoding [7] is obtained performing the difference between

one sample and the previous one, except for the first one, whose value

is left unchanged, as shown in Fig. 2.2.

It is to be noticed that each data of the original sequence can be recon-

structed by summing to the corresponding data in the coded sequence

all the previous data: for instance, 89 = 79+17+2+5+0+0+(−3)+

(−6) + (−5). So far it is very important to leave the first value in the

coded sequence unchanged, otherwise the reconstruction process can-

not be carried out correctly. The differential algorithm is well suited

for all data sequences with very small changes, in value, between con-

secutive samples: in fact for this kind of data streams the differential

encoding produces an encoded stream with a smaller dynamics, i.e. the

difference between the maximum and minimum values in the encoded

stream is smaller than the same value calculated in the original se-

quence. So far the encoded sequence can be represented with a smaller

number of bits than the original one.

32


...

17 19 24 24 24 21 15 10 89 95 96 96 96 95 94 94 95

17 2 5 0 0 −3 −6 −5 79 6 1 0 0 −1 −1 0 1

Original sequence

Sequence after differential encoding

Figure 2.2: Differential encoding

Beside that the differential encoding can be used in conjunction with

the Run Length encoding system: in fact, if a sequence contains long

sequences of equal values, it is converted into a sequence of zeros by the

differential encoder and then further compressed using the Run Length

encoder.

2.4.4 Dictionary techniques

In many applications, the output of a source consists of recurring pat-

terns. A classical example is a text source in which certain patterns

or words recur frequently. Also, there are certain patterns that simply

do not occur or, if they do, occurr with great rarity. A very reason-

able approach to encoding such sources is to keep a list or dictionary

of frequently occurring patterns. When these patterns appear in the

source, they are encoded with the reference to the dictionary contain-

ing the address to the right table location. If the pattern does not

appear in the dictionary, then it can be encoded using some other,

less efficient, method. In effect we are splitting the input domain in

two classes: frequently occurring patterns and infrequently occurring

patterns. For this technique to be effective, the class of frequently oc-

curring patterns, and hence the size of the dictionary, must be much

smaller than the number of all possible patterns. Depending upon how

much information is available to build a dictionary, it can be used a

static or a dynamic approach to the creation of the dictionary. Choos-

33


ing a static dictionary technique is most appropriate when considerable

prior knowledge about the source is available.

When no a priori information is available on the structure of the input

source an adaptive technique is adopted: for example the UNIX com-

press command makes use of this technique. It starts with a dictionary

of size 512, thus transmitting codewords 9-bit long. Once the dictio-

nary has filled up, the size of the dictionary is doubled to 1024 entries,

so far transmitting codewords 10-bit long. The size of the dictionary is

progressively filled up until it contains 216 entries, then compress be-

comes a static coding technique. At this point the algorithm monitors

the compression ratio: if it falls below a threshold, the dictionary is

flushed and the dictionary building process is restarted.

The dictionary techniques are also used in the image compression field

in the GIF (Graphics Interchange Format) standard, working in a very

similar way to the compress command.

2.4.5 Selective readout

The selective readout technique [10] is a lossless data compression tech-

nique usually applied in High Energy Physics Experiments. Since really

interesting data are a small fraction of the total amount of data actu-

ally produced, it proves useful to transmit and store only those data.

The selective readout may reduce the data size by identifying regions

in space containing a significant amount of energy. For example in

the SDD case, the Central Trigger Processor (CTP) unit defines a Re-

gion Of Interest (ROI) that, event by event, contains the information

of which ladders are to be read out and which ones can be discarded.

Using the ROI feature a very high compression ratio can be achieved.

34

2.5 — Lossy compression techniques

2.5 Lossy compression techniques

This section contains an explanation of the most widely used lossy com-

pression techniques. In particular the following items will be covered:

– zero suppression;

– transform coding;

– sub-band coding with some remarks on wavelets.

The first of these algorithms has been chosen for direct application in

the 1D compression algorithm implemented in the prototypes CARLOS

v1 and v2.

2.5.1 Zero supression

Zero suppression is the very simple technique of eliminating data sam-

ples below a certain threshold, by putting them to 0. Zero suppression

proves to be very useful in data containing large quantities of zeros and

interesting data concentrated in small clusters: for instance, being the

mean occupancy of a SDD in the inner layer of 2.5 %, a compression

ratio of 40 can be obtained by using the zero suppression technique

only.

A problem arises since the SDD data and, in general, data collections

contain the sum of two different distributions: the real signal corre-

sponding to the interesting physical event and a white noise with a

Gaussian distribution around a mean value. So far if a lossy compres-

sion algorithm obtains a good compression ratio just eliminating the

noise, the distortion introduced is absolutely acceptable. The key task

for a fair implementation of the zero suppression technique is the choice

of the right value of the threshold parameter, in order to eliminate noise

while preserving the physical signal.

In the case of data coming out from the SDD detector and related

front-end electronics, data values are shifted from the 0 level to a base-

line level greater than 0. This baseline level corresponds to the mean

35


value of the noise introduced by the preamplification electronics; then

there is a spread among this value due to the RMS of the Gaussian

distribution of the noise.

The noise level introduced by the electronics may vary with time and

with the amount of radiation absorbed: so far a compression algorithm

making use of the zero suppression technique has to allow a tunable

value of the threshold level, in order to accomodate fluctuations or

drifts in the baseline values. Following this indication, the threshold

level used in CARLOS v1 and v2 is completely presettable via software

using the JTAG port.

2.5.2 Transform coding

Transform coding [7] takes as input a data sequence and transforms it

into a sequence in which most part of the information is contained into

a few samples: so far the new sequence can be further compressed using

the other compression algorithms described up to now. The key point

of transform coding is the choice of the transform: this depends on the

features and redundancies of the input data stream to compress. The

algorithm, working on N elements at a time, consists of three steps:

– transform: the input sequence sn is split in N-long sequences;

then each block is mapped, using a reversible transformation, into

the sequence cn.– quantization: the transformed sequence cn is quantized, i.e. a

number of bits is assigned to each sample depending on the dy-

namics of the sequence, compression ratio desired and acceptable

distortion.

– coding : the quantized sequence cn is encoded using a binary

encoding technique such as Run Length encoding or the Huffman

coding.

These concepts can be expressed in a mathematical way: given a se-

quence in input sn, it is divided in N-long blocks and it is mapped

36


using the reversible transform A into the sequence cn:c = As (2.19)

or, in other terms:

cn =N−1∑i=0

sian,i con [A]i,j = ai,j (2.20)

Quantization and encoding steps are performed on the sequence cn,so to optimize compression.

The decompression algorithm, by means of the inverse transform B =

A−1, reconstructs the original sequence sn from the encoded sequence

cn, in the following way:

s = Bc (2.21)

or:

sn =N−1∑i=0

sibn,i con [B]i,j = bi,j (2.22)

These concepts can be easily extended to bi-dimensional data, such as

images or 2-D charge distributions, as in the case of the SDD.

Let us take a portion N × N of a digital image S, containing Si,j as

its (i, j)-th pixel; by performing a reversible bi-dimensional transform

A working on N ×N pixels at a time, with ai,j (i, j)-th element of the

transform matrix A and Ci,j (i, j)-th pixel of the block N × N of the

compressed image C, the following holds true:

Ck,l =

N−1∑i=0

N−1∑j=0

Si,jai,jak,l (2.23)

A transform is defined separable if it is possible to apply the 2D trans-

form of a N×N block by applying, first, a 1D transform on the N rows

of the block and, then, a transform on the N columns of the block, just

transformed; by choosing a separable transform the (2.23) becomes:

Ck,l =

N−1∑i=0

N−1∑j=0

Si,jak,ial,j (2.24)

37


or, expressed as a matrix:

C = ASAT (2.25)

The inverse transform is the following one:

S = BCBT (2.26)

Frequently orthonormal transforms are used, so that B = A−1 = AT ,

in a way that calculating the inverse trasform reduces to:

S = AT CA (2.27)

Even in the bi-dimensional case, in order to reach a high compression

ratio, a good transform has to be chosen. For instance the JPEG

standard has adopted, until the year 2000, the use of the Discrete

Cosine Transform, known as DCT.

If A is the matrix representing the DCT, the following relationship

follows:

[A]i,j = w(i) cos

((2j + 1)iπ

2N

)j = 0, 1, . . . , N − 1 (2.28)

where:

w(i) =

√

1N

i = 0√2N

i = 1, . . . , N − 1

Fig. 2.3 gives a graphical interpretation of (2.28).

After choosing the transform, the next step consists in the quantization

of the transformed image.

Several approaches are possible: for example the zonal mapping fore-

sees a preliminary analysis of the transformed coefficients statistics and

a later assignment of a fixed number of bits.

The name zonal mapping comes from the assignment of a fixed number

of bits depending on the zone in which each coefficient is placed in the

square N × N block under study; Tab. 2.6 reports an allocation bit

38


Figure 2.3: Base coefficients for the bi-dimensional DCT in the case N = 8

8 7 5 3 1 1 0 0

7 5 3 2 1 0 0 0

4 3 2 1 1 0 0 0

3 3 2 1 1 0 0 0

2 1 1 1 0 0 0 0

1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

Table 2.6: Allocation bit table for a 8 × 8 block

table for a 8× 8 block.

It is interesting to note that quantization in Tab. 2.6 assigns zero bits

to the coefficients in the lower-right side of the table: actually this is

equivalent to ignore these coefficients. This kind of quantization makes

sense since lower-right side coefficients come from a transformation of

the original image using high frequency cosines, i.e. these coefficients

contain an information corresponding to the high frequencies in the

original signal, see Fig. 2.3.

Since human eye response strongly depends on frequency and, in par-

ticular, it is sensible to variations at low frequencies and far less sensible

at higher frequencies, quantization in Tab. 2.6 tends to ignore informa-

tions that the human eye would not appreciate at all.

39


After quantization, only non-null coefficients are transmitted. In par-

ticular for every non-null coefficient, two words have to be transmitted:

the first with the quantized value of the coefficient itself; the second

containing the number of null samples occurred after the last non null

coefficient. This allows the decompression algorithm to exactly recon-

struct the sequence as it was quantized and, from that, the original

image.

As an example, let us suppose to have the 8 × 8 8-bit pixels image

reported in Tab. 2.7.

124 125 122 120 122 119 117 118

121 121 120 119 119 120 120 118

126 124 123 122 121 121 120 120

124 124 125 125 126 125 124 124

127 127 128 129 130 128 127 125

143 142 143 142 140 139 139 139

150 148 152 152 152 152 150 151

156 159 158 155 158 158 157 156

Table 2.7: 8× 8 block of a digital image

Each value of the block is translated of a factor 2p−1, where p is the

number of bits per pixel (in this case p = 8); then the DCT is applied

to the block obtaining the coefficients ci,j reported in Tab. 2.8.

39.88 6.56 -2.24 1.22 -0.37 -1.08 0.79 1.13

-102.43 4.56 2.26 1.12 0.35 -0.63 -1.05 -0.48

37.77 1.31 1.77 0.25 -1.50 -2.21 -0.10 0.23

-5.67 2.24 -1.32 -0.81 1.41 0.22 -0.13 0.17

-3.37 -0.74 -1.75 0.77 -0.62 -2.65 -1.30 0.76

5.98 -0.13 -0.45 -0.77 1.99 -0.26 1.46 0.00

3.97 5.52 2.39 -0.55 -0.051 -0.84 -0.52 -0.13

-3.43 0.51 -1.07 0.87 0.96 0.09 0.33 0.01

Table 2.8: DCT coefficients related to the block in Tab. 2.7.

40


As already stated high-frequency related coefficients in the lower-right

corner tend to be quite close to 0, while most of the information is

concentrated in the upper-left corner.

The quantization of the coefficients is obtained using the reference ta-

ble as in Tab. 2.9; in particular quantized lij values are obtained with

the following formula:

lij =

⌊(cijQt

ij

+ 0.5

)⌋(2.29)

where Qtij is the (i,j)-th element of the quantization table and bc is a

function for which bxc is the greatest integer less than x.

16 11 10 16 24 40 51 61

12 12 14 19 26 58 60 55

14 13 16 24 40 57 69 56

14 17 22 29 51 87 80 62

18 22 37 56 68 109 103 77

24 35 55 64 81 104 113 92

49 64 78 87 103 121 120 101

72 92 95 98 112 100 103 99

Table 2.9: Quantization table

Tab. 2.10 contains the resulting bit allocation table obtained using the

values contained in the quantization table Tab. 2.9:

After studying the structure of matrices like Tab. 2.10, the order chosen

for sending coefficients is the one shown in Fig. 2.4.

This choice allows to have a high probability that the final sequence

contains a lot of zero coefficients; so far this part of the sequence can

be encoded using the Run-Length technique.

2.5.3 Subband coding

A signal can be decomposed in different frequency components (see

Fig. 2.5) using analog or digital filters, then each resulting signal can

41


2 1 0 0 0 0 0 0

-9 0 0 0 0 0 0 0

3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

Table 2.10: Resulting bit allocation table

Figure 2.4: Zig-zag scanning pattern for an 8x8 transform

be encoded and compressed using a specific algorithm. Digital filtering

[9] involves taking a weighted sum of current and past inputs to the

filter and, in some cases, the past outputs to the filter. The general

form of the input-output relationship of the filter is given by:

yn =

N∑i=0

aixn−i +

M∑i=1

biyn−i (2.30)

where the sequence xn is the input to the filter, the sequence yn is

the output from the filter and the values ai and bi are called the filter

coefficients. If the input sequence is a single 1 followed by all 0s, the

output sequence is called the impulse response of the filter. The im-

42


input signal

Figure 2.5: Decomposition of a signal in frequency components

pulse response completely specifies the filter: once we know the impulse

response of the filter, we know the relationship between the input and

the output of the filter. Notice that if the bi are all zero, there the

impulse response will die out after N samples. These filters are called

finite impulse response or FIR filters. In FIR filters Eq. 2.30 reduces

to a convolution operation between the input signal and the filter co-

efficients. Filters with the nonzero values for some of the bi are called

infinite response filters or IIR filters.

The basic subband coding works as follows: the source is passed

through a bank of filters (a 3-level filter bank is shown in Fig. 2.6),

called the analysis filter bank which covers the range of frequencies

that make up the source; the outputs of the filters are then subsampled

as in Fig. 2.7. The justification of subsampling is the Nyquist rule and

its generalization, which tells that for perfect reconstruction we only

need twice as many samples per second as the range of frequencies.

This means that it is possible to reduce the number of samples at the

output of the filter as the range of frequencies is less than the range of

frequencies at the input of the filter. The process of reducing the num-

ber of samples is called decimation or downsampling. The amount of

decimation depends on the ratio of the bandwidth of the filter output

43


Low pass filter

High pass filter

Low pass filter

High pass filter

Low pass filter

High pass filter

Low pass filter

High pass filter

High pass filter

Low pass filter

Low pass filter

High pass filter

High pass filter

Low pass filter

Figure 2.6: An 8-band 3-level filter bank

to the filter input. If the bandwidth at the output of the filter is 1/M

of the bandwidth at the input of the filter, the output is decimated by

a factor of M by keeping every Mth sample. Once the output of the

filters has been decimated, the output is encoded using one of several

encoding schemes explained so far.

Along with the selection of the compression scheme, the allocation of

bits between the subbands is an important design parameter, since

different subbands contain differing amounts of information. The bit

allocation procedure can have a significant impact on the quality of

the final reconstruction, especially when the information component of

different bands is very different.

The decompression phase, in subband coding also named synthesis,

works as follows: first the encoded samples for each subband are de-

coded at the receiver, then the decoded values are upsampled by in-

serting an appropriate number of zeros between the samples, then the

upsampled signals are passed through a bank of reconstruction filters

and added together.

44


input

G

H~

~

signal

2

2

ν

ν

Analysis filter 1

Analysis filter 2Downsampling

Downsampling

Encoder 2

Encoder 1

Figure 2.7: Subband coding technique: analysis filter bank, downsam-

pling and encoding

Subband coding has applications in speech coding and audio coding

with the MPEG audio, but can be applied also to image compression.

2.5.4 Wavelets

Another method of decomposing signals that has gained a great deal

of popularity in recent years is the use of wavelets [11, 12, 13, 14].

Decomposing a signal in terms of its frequency content using sinusoids

results in a very fine resolution in the frequency domain. However

siinusoids are defined on the time domain from −∞ to ∞, therefore

individual frequency components give no temporal resolution [15].

In a wavelet representation, a signal is represented in terms of functions

that are localized both in time and in frequency. For instance, the

following is known as the Haar wavelet :

ψ0,0(x) =

1 0 ≤ x < 1

2

−1 12≤ x < 1

(2.31)

45


ψ0,0

2,22,12,0

1,0 1,1

ψψψ

ψψ

Figure 2.8: The Haar wavelet

From this “mother” function the following set of functions can be ob-

tained:

ψj,k(x) = ψ0,0(2jx− k) =

1 k2−j ≤ x < (k + 1

2)2−j

−1 (k + 12)2−j ≤ x < (k + 1)2−j

(2.32)

A few of these functions are shown in Fig. 2.8. It is to be noticed that

as j increases, the functions become more and more localized in time.

This localization action in known as dilation and it allows to represent

local changes accurately using very few coefficients. The effect of k in

Equation (2.32) is to move or translate the wavelet. The components

of a wavelet expansion are obtained from a mother wavelet through the

actions of dilation and translation. In Equation (2.32) any real num-

bers a and b could be used insetad of the dilation factor 2j and the

translation factor k, but this is the usual choice for the discrete wavelet

representation.

The wavelet representation may provide a better approximation to the

input with fewer coefficients; however its usefulness depends to a large

extent on the ease of implementation.

46


(a) (b)

(c) (d)

Figure 2.9: Example of multiresolution analysis

In 1989, Stephane Mallat ([16]) developed the multiresolution approach,

which moved the representation using wavelets into the domain of sub-

band coding. These concepts can be better understood with the help of

an example. Let us suppose we have to approximate the function f(t)

drawn in Fig. 2.9a using the translated versions of some time-limited

function φ(t). The indicator function is a simple approximating func-

tion:

φ(t) =

1 0 ≤ t < 1

0 otherwise(2.33)

Let us now define the translated versions of φ(t), φ0,k(t):

φ0,k(t) = φ(t− k) (2.34)

Then it is possible to approximate the waveform f(t) by a linear com-

bination of the φ(t) and its translates as shown in Fig. 2.9b:

φ0f(t) =

N−1∑k=0

c0,kφ0,k (2.35)

where

φ0,0(t) = φ(t) (2.36)

47


and c0,k are the average values of the function in the interval [k− 1, k).

In other words:

c0,k =

∫ k+1

k

f(t)φ0,k(t)dt (2.37)

It is possible to scale φ(t) to obtain:

φ1,0(t) = φ0,0(2t) =

1 0 ≤ t < 1

2

0 otherwise(2.38)

Its translates would be given by:

φ1,k(t) = φ1,0(t− k) (2.39)

= φ0,0(2t− k) = (2.40)1 0 ≤ 2t− k < 1

0 otherwise(2.41)

1 k2≤ t < (k+1)

2

0 otherwise(2.42)

Approximating the function f(t) using the translates φ1,0(t), the ap-

proximation φ1f(t) is obtained as shown in Fig. 2.9c:

φ1f(t) =

2N−1∑k=0

c1,kφ1,k (2.43)

where:

c1,k = 2

∫ (k+1)/2

k/2

f(t)φ0,k(t)dt (2.44)

In this case we need twice as many coefficients compared to the previous

case. The two sets of coefficients are related by:

c0,k =1

2(c1,2k + c1,2k+1) (2.45)

If we wanted to get a closer approximation we could obtain a further

scaled version of φ(t) and so on until we obtain an accurate represen-

tation of f(t) (see Fig. 2.9d). Now let us assume that the function f(t)

48


is accurately represented by φ1f(t). φ1

f(t) can be decomposed into a

lower resolution version of itself, namely φ0f(t) and the difference φ1

f (t)

- φ0f(t). Let us examine this function over an arbitrary interval [k,k+1):

φ1f(t)− φ0

f(t) =

c0,k − c1,2k k ≤ t < k + 1

2

c0,k − c1,2k+1 k + 12≤ t < k + 1

(2.46)

Substituting for c0,k from (2.45) we obtain:

φ1f(t)− φ0

f(t) =

−1

2c1,2k + 1

2c1,2k+1 k ≤ t < k + 1

212c1,2k − 1

2c1,2k+1 k + 1

2≤ t < k + 1

(2.47)

Defining:

b0,k = −c1,2k + c1,2k+1 (2.48)

over the interval [k,k+1), φ1f(t) - φ0

f(t) can be written as:

φ1f(t)− φ0

f(t) = b0,kψ0,k(t) (2.49)

where ψ0,k(t) is the kth translate of the ψ0,0(t) and:

ψ0,0(t) =

1 0 ≤ t < 1

2

−1 12≤ t < 1

(2.50)

which is the Haar wavelet. The 2N-point sequence c1,k can be decom-

posed into two N-point sequences c0,k and b0,k. The first decomposition

is performed using a set of functions called the scaling function: this

decomposition produces the approximation coefficients. The second

component is obtained in terms of a wavelet and its translates: this

decomposition produces the detail coefficients.

This example can be generalized as follows. Let φj,k(t) be a set of

functions with the following properties:

1.

φj,0(t) = φ0,0(2jt) (2.51)

49


2. If a function can be expressed exactly by a linear combination of

the set φj,k(t), then it can also be expressed exactly as a function

of the set φl,k(t) for all l ≥ j.

3. The complete set φj,k(t)∞j,k=−∞ can exactly represent all func-

tions with the property that:∫ ∞

−∞|f(t)|2 <∞ (2.52)

4. If a function f(t) can be exactly represented by the set φ0,k(t),then any integer translate of the function f(t − k) can also be

represented exactly by the same set.

5. ∫φ0,l(t)φ0,k(t)dt =

0 l 6= k

1 l = k(2.53)

The set forms a multiresolution analysis [16]. So far at any resolution

2−j every function f(t) can be decomposed in two components: one

that can be expressed as a function of the set φj,k(t) and one that

can be expressed as a linear combination of the wavelets ψj,k(t).The mother wavelet ψ0,0(t) and the scaling function φ0,0(t) are related

in the following manner: from Property 2, φ0,0 can be written in terms

of φ1,k. If the relationship is given by:

φ0,0(t) =∑

hnφ1,n(t) (2.54)

Then the wavelet ψ0,0(t) is given by:

ψ0,0(t) =∑

(−1)nhnφ1,n(t) (2.55)

From this relationship we can assume that the wavelet decomposition

can be implemented in terms of filters with impulse responses given

by (2.54) and (2.55) and that the filters are quadrature mirror filters.

Most of the orthonormal wavelets are nonzero over an infinite inter-

val. Therefore the corresponding filters are IIR filters. Well known

50

2.6 — Implementation of compression algorithms

exceptions are the Daubechies wavelets that correspond to FIR filters.

Once obtained the coefficients of the FIR filters, the procedure for com-

pression using wavelets is identical to the one described for subband

coding. From now on the terms multiresolution analysis and wavelet-

based analysis will be regarded as synonymous. Some of the most used

wavelets families are shown in Fig. 2.10, Fig. 2.11 and Fig. 2.12.

2.6 Implementation of compression algo-

rithms

Compression algorithms can be implemented in hardware or in soft-

ware, depending on the required speed. When speed is the most impor-

tant constraint on the choice of the implementation of the compression

algorithm, hardware implementation becomes necessary.

Commercial devices exist implementing data compression in hardware:

for example the ALDC1-40S-M from IBM featuring an adaptive lossless

data compression works at a rate of 40 MBytes/s, while the AHA32321

chip from Aha can compress and decompress data at 10 MBytes/s with

a clock frequency of 40 MHz. These rates are far too small than the

one required for what concerns the SDD readout: in fact the com-

pression chip we need has to face an input data rate of 320 MByte/s.

No commercial chip exists with such features, so we had to design an

Application Specific Integrated Circuit (ASIC) targeted to our require-

ments.

51


00.2

0.40.6

0.81

00.20.40.60.81Sc

aling

func

tion p

hi

00.2

0.40.6

0.81

−1−0.500.51

Wav

elet fu

nctio

n psi

01

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

01

−0.500.5

Reco

nstru

ction

low−

pass

filter

01

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

01

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

00.5

11.5

22.5

3

00.51

Scali

ng fu

nctio

n phi

00.5

11.5

22.5

3

−1−0.500.511.5

Wav

elet fu

nctio

n psi

01

23

−0.4

−0.200.20.40.60.8

Deco

mpos

ition l

ow−p

ass f

ilter

01

23

−0.4

−0.200.20.40.60.8

Reco

nstru

ction

low−

pass

filter

01

23

−0.4

−0.200.20.40.60.8

Deco

mpos

ition h

igh−p

ass f

ilter

01

23

−0.4

−0.200.20.40.60.8

Reco

nstru

ction

high

−pas

s filte

r

01

23

45

00.51

Scali

ng fu

nctio

n phi

01

23

45

−1−0.500.511.5

Wav

elet fu

nctio

n psi

01

23

45

−0.4

−0.200.20.40.60.8

Deco

mpos

ition l

ow−p

ass f

ilter

01

23

45

−0.4

−0.200.20.40.60.8

Reco

nstru

ction

low−

pass

filter

01

23

45

−0.4

−0.200.20.40.60.8

Deco

mpos

ition h

igh−p

ass f

ilter

01

23

45

−0.4

−0.200.20.40.60.8

Reco

nstru

ction

high

−pas

s filte

r

05

1015

−0.4

−0.200.20.40.60.8

Scali

ng fu

nctio

n phi

05

1015

−1−0.500.5

Wav

elet fu

nctio

n psi

0 2

4 6

810

1214

1618

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

0 2

4 6

810

1214

1618

−0.500.5

Reco

nstru

ction

low−

pass

filter

0 2

4 6

810

1214

1618

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

0 2

4 6

810

1214

1618

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

00.2

0.40.6

0.81

00.20.40.60.81Sc

aling

func

tion p

hi

00.2

0.40.6

0.81

−1−0.500.51

Wav

elet fu

nctio

n psi

01

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

01

−0.500.5

Reco

nstru

ction

low−

pass

filter

01

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

01

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

Dau

bach

ies

db1

db2

db3

db10

haar

Haa

r

Figure 2.10: Some functions belonging to different wavelet families: note

that db1 is equivalent to the Haar

52

2.6 — Implementation of compression algorithms

00.5

11.5

22.5

3

00.51

Scali

ng fu

nctio

n phi

00.5

11.5

22.5

3

−1.5−1−0.500.51

Wav

elet fu

nctio

n psi

01

23

−0.4

−0.200.20.40.60.8

Deco

mpos

ition l

ow−p

ass f

ilter

01

23

−0.4

−0.200.20.40.60.8

Reco

nstru

ction

low−

pass

filter

01

23

−0.4

−0.200.20.40.60.8

Deco

mpos

ition h

igh−p

ass f

ilter

01

23

−0.4

−0.200.20.40.60.8

Reco

nstru

ction

high

−pas

s filte

r

01

23

45

00.51

Scali

ng fu

nctio

n phi

01

23

45

−1.5−1−0.500.51

Wav

elet fu

nctio

n psi

01

23

45

−0.4

−0.200.20.40.60.8

Deco

mpos

ition l

ow−p

ass f

ilter

01

23

45

−0.4

−0.200.20.40.60.8

Reco

nstru

ction

low−

pass

filter

01

23

45

−0.4

−0.200.20.40.60.8

Deco

mpos

ition h

igh−p

ass f

ilter

01

23

45

−0.4

−0.200.20.40.60.8

Reco

nstru

ction

high

−pas

s filte

r

01

23

45

67

−0.200.20.40.60.811.2

Scali

ng fu

nctio

n phi

01

23

45

67

−1−0.500.511.5

Wav

elet fu

nctio

n psi

01

23

45

67

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

01

23

45

67

−0.500.5

Reco

nstru

ction

low−

pass

filter

01

23

45

67

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

01

23

45

67

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

05

1015

−0.200.20.40.60.81

Scali

ng fu

nctio

n phi

05

1015

−0.500.51

Wav

elet fu

nctio

n psi

0 2

4 6

810

1214

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

0 2

4 6

810

1214

−0.500.5

Reco

nstru

ction

low−

pass

filter

0 2

4 6

810

1214

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

0 2

4 6

810

1214

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

01

23

45

00.511.5

Scali

ng fu

nctio

n phi

01

23

45

−1−0.500.511.52

Wav

elet fu

nctio

n psi

01

23

45

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

01

23

45

−0.500.5

Reco

nstru

ction

low−

pass

filter

01

23

45

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

01

23

45

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

02

46

810

−0.200.20.40.60.811.2

Scali

ng fu

nctio

n phi

02

46

810

−0.500.511.5

Wav

elet fu

nctio

n psi

0 2

4 6

810

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

0 2

4 6

810

−0.500.5

Reco

nstru

ction

low−

pass

filter

0 2

4 6

810

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

0 2

4 6

810

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

05

1015

−0.200.20.40.60.81

Scali

ng fu

nctio

n phi

05

1015

−0.500.51

Wav

elet fu

nctio

n psi

0 2

4 6

810

1214

16

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

0 2

4 6

810

1214

16

−0.500.5

Reco

nstru

ction

low−

pass

filter

0 2

4 6

810

1214

16

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

0 2

4 6

810

1214

16

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

05

1015

2025

−0.200.20.40.60.81

Scali

ng fu

nctio

n phi

05

1015

2025

−0.500.51

Wav

elet fu

nctio

n psi

0 4

812

1620

2428

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

0 4

812

1620

2428

−0.500.5

Reco

nstru

ction

low−

pass

filter

0 4

812

1620

2428

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

0 4

812

1620

2428

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

sym

2sy

m3

sym

4sy

m8

Coi

flet

s

coif

1co

if2

coif

3co

if5

Sym

lets

Figure 2.11: Some functions belonging to different wavelet families

53


00.2

0.40.6

0.81

00.51De

comp

ositio

n sca

ling f

uncti

on ph

i

00.2

0.40.6

0.81

00.51Re

cons

tructi

on sc

aling

func

tion p

hi

00.2

0.40.6

0.81

−1−0.500.51

Deco

mpos

ition w

avele

t func

tion p

si

00.2

0.40.6

0.81

−1−0.500.51

Reco

nstru

ction

wav

elet fu

nctio

n psi

01

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

01

−0.500.5

Reco

nstru

ction

low−

pass

filter

01

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

01

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

01

23

4

00.51

Deco

mpos

ition s

calin

g fun

ction

phi

01

23

400.51

Reco

nstru

ction

scali

ng fu

nctio

n phi

01

23

4

−1−0.500.51

Deco

mpos

ition w

avele

t func

tion p

si

01

23

4−1−0.500.51

Reco

nstru

ction

wav

elet fu

nctio

n psi

01

23

45

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

01

23

45

−0.500.5

Reco

nstru

ction

low−

pass

filter

01

23

45

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

01

23

45

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

02

46

8

00.51

Deco

mpos

ition s

calin

g fun

ction

phi

02

46

800.51

Reco

nstru

ction

scali

ng fu

nctio

n phi

02

46

8

−1−0.500.51

Deco

mpos

ition w

avele

t func

tion p

si

02

46

8−1−0.500.51

Reco

nstru

ction

wav

elet fu

nctio

n psi

0 1

2 3

4 5

6 7

8 9

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

0 1

2 3

4 5

6 7

8 9

−0.500.5

Reco

nstru

ction

low−

pass

filter

0 1

2 3

4 5

6 7

8 9

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

0 1

2 3

4 5

6 7

8 9

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

05

1015

00.51

Deco

mpos

ition s

calin

g fun

ction

phi

05

1015

00.51

Reco

nstru

ction

scali

ng fu

nctio

n phi

05

1015

−0.500.511.5

Deco

mpos

ition w

avele

t func

tion p

si

05

1015

−0.500.51

Reco

nstru

ction

wav

elet fu

nctio

n psi

0 2

4 6

810

1214

16

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

0 2

4 6

810

1214

16

−0.500.5

Reco

nstru

ction

low−

pass

filter

0 2

4 6

810

1214

16

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

0 2

4 6

810

1214

16

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

00.2

0.40.6

0.81

00.51De

comp

ositio

n sca

ling f

uncti

on ph

i

00.2

0.40.6

0.81

00.51Re

cons

tructi

on sc

aling

func

tion p

hi

00.2

0.40.6

0.81

−1−0.500.51

Deco

mpos

ition w

avele

t func

tion p

si

00.2

0.40.6

0.81

−1−0.500.51

Reco

nstru

ction

wav

elet fu

nctio

n psi

01

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

01

−0.500.5

Reco

nstru

ction

low−

pass

filter

01

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

01

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

01

23

400.51

Deco

mpos

ition s

calin

g fun

ction

phi

01

23

4

00.51

Reco

nstru

ction

scali

ng fu

nctio

n phi

01

23

4−1−0.500.51

Deco

mpos

ition w

avele

t func

tion p

si

01

23

4

−1−0.500.51

Reco

nstru

ction

wav

elet fu

nctio

n psi

01

23

45

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

01

23

45

−0.500.5

Reco

nstru

ction

low−

pass

filter

01

23

45

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

01

23

45

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

05

1015

00.51

Deco

mpos

ition s

calin

g fun

ction

phi

05

1015

00.51

Reco

nstru

ction

scali

ng fu

nctio

n phi

05

1015

−0.500.51

Deco

mpos

ition w

avele

t func

tion p

si

05

1015

−0.500.511.5

Reco

nstru

ction

wav

elet fu

nctio

n psi

0 2

4 6

810

1214

16

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

0 2

4 6

810

1214

16

−0.500.5

Reco

nstru

ction

low−

pass

filter

0 2

4 6

810

1214

16

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

0 2

4 6

810

1214

16

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

02

46

800.51

Deco

mpos

ition s

calin

g fun

ction

phi

02

46

8

00.51

Reco

nstru

ction

scali

ng fu

nctio

n phi

02

46

8−1−0.500.51

Deco

mpos

ition w

avele

t func

tion p

si

02

46

8

−1−0.500.51

Reco

nstru

ction

wav

elet fu

nctio

n psi

0 1

2 3

4 5

6 7

8 9

−0.500.5

Deco

mpos

ition l

ow−p

ass f

ilter

0 1

2 3

4 5

6 7

8 9

−0.500.5

Reco

nstru

ction

low−

pass

filter

0 1

2 3

4 5

6 7

8 9

−0.500.5

Deco

mpos

ition h

igh−p

ass f

ilter

0 1

2 3

4 5

6 7

8 9

−0.500.5

Reco

nstru

ction

high

−pas

s filte

r

bior

1.1

bior

1.3

bior

1.5

bior

6.8

Bio

rtho

gona

l Wav

elet

s

rbio

1.1

rbio

1.3

rbio

1.5

rbio

6.8

Rev

erse

Bio

rtho

gona

l Wav

elet

s

Figure 2.12: Some functions belonging to different wavelet families: note

that bior1.1 and rbior1.1 are equivalent to the haar

54

Chapter 3

1D compression algorithm

and implementations

3.1 Compression algorithms for SDD

The choice of the algorithm for SDD data compression is strictly related

to the input data stream features:

– low detector occupancy (max 3 %)

– small samples are much more probable than high samples

The first feature suggests the use of a zero suppression algorithm: all

samples below a certain value (depending on the noise distribution)

are discarded. The second feature suggests to adopt an entropy coder,

such as the Huffman one. Beside that it is important for the algorithm

to contain software tunable parameters in order to re-optimize the al-

gorithm performance in case of changes on the statistics of the input

distribution. For instance the threshold level has to be changeable via

software in order to take into account of changes on the signal to noise

ratio over the years, so the Huffman tables have to be reconfigurable

too. The other important features for the compression algorithms are:

– they have to be fast

55

1D compression algorithm and implementations

– they have to be simple to implement in hardware

– they have to allow lossless data transmission

For the development of the compression algorithms, studies have been

performed on the statistical distribution of the sample data coming

from the single-particle events of three beam tests, so that noise could

be properly taken into account. The compression results have been

evaluated in order to verify the algorithm efficiency and the best pa-

rameter values.

3.2 1D compression algorithm

Following these requirements the INFN Section of Torino has chosen

a sequential compression algorithm [17] which scans data coming from

each anode row as uni-dimensional data streams. As shown in Fig. 3.1

as an example, data samples coming from anode 76 are processed, then

from anode 77 and so on. The ultimate goal of the algorithm is to

save data belonging to a cluster, while rejecting all the other samples

regarded as noise. To have a data reduction system that is applicable

to all the situations, the algorithm is provided with different tuning

parameters (Fig. 3.2 provides a graphical explanation of them):

– threshold : the threshold parameter is applied to the incoming

samples, forcing the differences to zero if they are smaller than

this value. This parameter has the goal of eliminating noise and

pedestals affecting data.

– tolerance: the tolerance parameter is applied to differences calcu-

lated between consecutive samples, forcing them to zero if they

are less than this value (using this mechanism samples not very

different are considered equal). So far non significant fluctuations

of the input values are eliminated using the tolerance mechanism.

– disable: the disable parameter is applied to the input data, re-

moving all previous mechanisms for samples greater than disable

56

3.2 — 1D compression algorithm

Figure 3.1: Cluster in two dimensions and its slices along the anode di-

rection

in order to have full information on the clusters and to maintain

good double peak resolution. This means that the important in-

formation is not affected by the lossy compression algorithm.

The 1D algorithm actually consists of 5 processing steps sequentially

applied (see Fig. 3.3):

– first the input data values below the threshold parameter value

are put to 0;

– then, the difference between a sample and the previous one (along

the time direction) is calculated;

– if the difference value is smaller than the tolerance parameter and

if the input sample is smaller the the disable parameter, then the

difference value is put to 0, otherwise its value is left unchanged;

– these values are then encoded using the Huffman table;

– the obtained values are then encoded using the Run Length en-

coding method.

57


disable

−tolerance

+tolerance

threshold

time

an

od

ic s

ign

al

Figure 3.2: Threshold, tolerance and disable parameters

The high probability of finding long zero sequences in the SDD charge

distribution makes the Run Length encoding use very effective, espe-

cially when combined with threshold, tolerance and disable mechanisms.

3.3 1D algorithm performances

As explained in Chapter 1 in order to comply with the target figures of

DAQ speed and magnetic tape usage, the size of the SDD event has to

be reduced from 32.5 MBytes to about 1.5 MBytes, which corresponds

to a target compression coefficient of 22. Several standard compression

algorithms have been evaluated on SDD test beam events data in order

to have an estimation of the compression performances achievable: the

best compression coefficient has been obtained with the gzip utility

implemented in the Unix operating system, so far it was chosen for

comparison with our 1D algorithm. The data was submitted to the

gzip program into a binary format for a fair comparison.

58

3.3 — 1D algorithm performances

simple threshold zero suppression

differential encoding

tolerance

Huffman encoding

run length encoding

threshold

Huffman tables

tolerance

software tunable parameters

compressed data

input stream

Figure 3.3: 1D compression algorithms

3.3.1 Compression coefficient

For the comparison task data coming from the August 1998 test beam

was chosen. The gzip compression algorithm achieves a compression

ratio around 2: this value is too far from our target value of 22.

The 1D compression algorithm has been applied using a threshold value

of 20 = 1 ∗ noise mean + 1.35 ∗ noise RMS and tolerance = 0: the

compression value obtained is around 12.5. This is still an unaccept-

able value for our purposes. The goal compression value of 22 can

only be reached by increasing the threshold parameter, which implies a

larger information loss. For instance by applying the algorithm on the

same test beam data it is possible to obtain a compression coefficient

of about 33, with threshold = 40 = 1∗noise mean+2.68∗noise RMS

and tolerance = 0. Fig. 3.4 shows the variation of the compression

coefficient using the 1D algorithm as a function of the threshold level

between 20 and 40 and for two values of tolerance.

59


Figure 3.4: 1D compression ratio as a function of threshold and tolerance

An important feature of this compression algorithm is that it can be

reversed to a lossless algorithm simply by putting the values of thresh-

old and tolerance to 0. Sending data without losing any information

will be very useful for the first event acquisitions since raw data will be

analyzed for determing statistics, noise and so on. These raw data will

also be used for determining the best Huffman tables, the ones allowing

to obtain the best compression coefficient. When used in lossless mode,

meaning that only differential encoding, Huffman and run length en-

coding are applied, the compression coefficient obtained is 2.3, that is

even better than what we obtain with the gzip algorithm.

3.3.2 Reconstruction error

So far it was to be checked if the information loss introduced with a

threshold level of 40 is acceptable or not. In particular it was decided to

study how much data compression and decompression affected clusters

geometry for what concerns centroid position and charge.

A cluster finding routine was developed with the following two step

procedure:

60

3.3 — 1D algorithm performances

Figure 3.5: Spreads introduced by data compression on measurement of

coordinates of the SDD clusters and of the cluster charge (bottom right)

– data streams are analyzed one anode row after the other: when

a sample value is higher than a certain threshold level for two

consecutive time bins, it is considered to be a hit until it goes

below the same threshold for two consecutive time bins;

– then if any two 1-D hits from adjacent anodes overlap in time they

are considered as a part of a two-dimensional cluster.

After finding samples belonging to clusters they are fitted with a two-

dimensional Gaussian function, with the following features:

– the mean value corresponds to the cluster centroid;

– the sigma value corresponds to the centroid resolution;

– the volume under the Gaussian function corresponds to the charge

released on the detector by the ionizing particle.

61


1D compression and decompression algorithms were then applied on

test beam data, performed cluster finding and analysis on both data:

the results are shown in Fig. 3.5. The picture on the upper left shows

the distribution of the differences in the centroid coordinates before

and after compression along the anode and drift time direction. The

picture on the upper right shows the same distribution on the drift time

direction, while the picture on the bottom left shows the distribution

along the anode direction. These plots show that the compression algo-

rithm with a threshold of 40 does not introduce biases on the centroid

coordinate measurements, but that worsen their accuracy by about 9

µm (+4%) along the anode direction and by about 16 µm (+8%) along

the drift time axis. The bottom right picture shows the percentual dif-

ference of charge before and after compression: so far the 1D algorithm

also introduces an underestimation of the cluster charge of about 4 %.

3.4 CARLOS v1

During 1999 I have collaborated with INFN group in Torino for the

design and test of a first hardware implementation of the 1D algorithm:

CARLOS v1. This device is physically implemented as a PCB (Printed

Circuit Board) containing 2 FPGAs (Field Programmable Gate Array)

circuits and some connectors for use in a test beam data acquisition

system, as shown in Fig. 3.6. The device processes data coming from

one macrochannel only, that is data coming from one half-detector, and

directly interfaces the SIU board, the first stage of the DAQ system.

3.4.1 Board description

The main two processing blocks mounted on the board are the two

Xilinx FPGA devices. An FPGA is a completely programmable de-

vice widely used for fast prototyping before the final implementation of

the design on an ASIC circuit which requires more resources as far as

62

3.4 — CARLOS v1

Figure 3.6: CARLOS prototype v1 picture

time, money and design efforts. An FPGA contains a matrix of CLBs

(Configurable Logic Blocks) that can be individually programmed and

connected together in order to implement the desired input/output

logic function. Each CLB contains a SRAM (Static RAM) that is used

to implement a logic function by putting the input values on the ad-

dress bus: they are used as look-up tables.

An other piece of silicon area on the FPGA die contains the con-

figuration RAM : depending on the contents of this block the device

will accomplish different logic functions. The configuration RAM is

written on power-on from an external EPROM: CARLOS v1 hosts two

EPROM devices for the configuration of the two FPGAs. The pro-

cess of configuration takes around 20 ms, after which the devices are

completely operational. A 10 MHz clock generator is hosted between

the EPROM chips: we could not achieve a higher working frequency

with our choice of FPGA device. In fact the final operating frequency

63


Features Values

Logic cells 2432

Max logic gates (no RAM) 25k

Max RAM bits (no logic) 32768

Typical gate range (logic and RAM) 15k - 45k

CLB matrix 32x32

Total CLBs 1024

Number of flip-flops 2560

Number of user I/O 256

Table 3.1: XC4025 Xilinx FPGA main features

is a function of how many internal resources are being used: the more

resources are used, the slower becomes the final working frequency.

With the final 10 MHz frequency we reached a good trade-off between

logic complexity and speed; furthermore this frequency was sufficient

for application in a test-beam environment. Tab. 3.1 reports the main

features of the chosen FPGA devices XC4025E-4 HQ240C.

The board also contains 3 connectors from left to right:

– the first is used for data injection into the first FPGA device using

a Hewlett Packard (HP) pattern generator;

– the second one is used for analyzing data coming out from the

first device by making use of a logic analyzer probe;

– the third connector is used for the communication between CAR-

LOS v1 and the SIU board. Fig. 3.7 shows a picture of the final

SIU board. We used a SIU simplified version called SIMU (SIU

simulator), distributed at CERN for helping front-end designers to

realize DAQ-compatible devices. The SIMU board can be directly

plugged onto this connector.

64

3.4 — CARLOS v1

Figure 3.7: Picture of the SIU board

3.4.2 CARLOS v1 design flow

I have carried out the design of the second FPGA device following the

digital design flow shown in Fig. 3.8. In particular the design flow is

composed by the following steps:

– block specifications have been coded with the VHDL language

using a hierarchical structure starting from the bottom layer up

to the top-level;

– each VHDL model has been simulated in order to debug the code

using the Synopsys simulator software;

– each VHDL model has been synthesized, that means translated to

a netlist, using the Synopsys synthesis tool; the netlist contains

usual standard cells such as AND, OR or flip-flops, but the FPGA

device does not contain these elements, it contains only RAM

blocks. The netlist is only a logic representation of the circuit

itself, it has no physical meaning.

– the netlist is simulated using the Synopsys simulator, taking into

account cell timing delays and constraints.

– the netlist is automatically converted into a physical layout using

65


Figure 3.8: Digital design flow for CARLOS v1

the place and route software Alliance from Xilinx.

– the layout information is put in a binary file ready to be down-

loaded on the EPROM chip using the Alliance software, together

with an EPROM programmer.

This is a very straight-forward and automated process; besides the

time needed between a slight modification in the VHDL code and its

actual implementation in the FPGA device is very short. This is the

main reason why FPGAs are so widely used for prototyping. An other

very important reason is the following one: running millions of test

vectors as a software simulation of a VHDL model is a very long process

even for fast machines; the same set of test vectors can be run in a

few seconds on the hardware prototype. FPGA implementation easily

allows algorithms verification on a huge amount of data.

66

3.4 — CARLOS v1

3.4.3 Functions performed by CARLOS v1

The FPGA on the left in Fig. 3.6 contains the 1D compression algo-

rithm, as explained in the previous sections, composed of 5 processing

blocks sequentially applied to the input data. The blocks form a 5-level

pipeline chain, each one requiring one clock cycle. The variable-length

compression coefficients are produced as 32-bit long words.

The FPGA on the right contains the following blocks:

– firstcheck : this block processes 32-bit input words coming from

the compressor FPGA: if the MSB is high the incoming data is

rejected, otherwise it is accepted and splitted in two different data

words, one 26-bit wide containing the variable length code and one

5-bit one containing the information of how many bits have to be

stored.

– barrel : this block packs 2 to 26 bits variable length codes in fixed-

size 32 bits words. The information of how many bits from 2 to 26

have to be stored is contained in the 5-bit length bus coming from

the firstcheck block. Variable length Huffman codes packed in 32-

bit words can be uniquely unpacked by using the Huffman table

and starting from the MSB to LSB. When a word is complete an

output-push signal is asserted.

– fifo: it contains a 64x32 RAM memory wide for storing data com-

ing out of the barrel shifter. When the FIFO contains at least

16 data words it asserts a query signal in order to ask the feesiu

block to begin data popping.

– feesiu: this is the most complex block of the prototype containing

the interface between CARLOS and the SIU board. The main be-

havior is quite simple: CARLOS waits for a “Ready to Receive”

(RDYRX) command from the SIU on a bidirectional data bus;

after receiving it CARLOS takes possession of the bidirectional

bus and begins sending data towards the SIU as 17 32-bit words

packets. Each packet is built as a header word containing exter-

67


nally hardwired informations and 16 data words coming out of the

FIFO. When the FIFO is empty or it does not contain 16 data

words, no valid data is sent to the SIU. Otherwise if a FIFO begins

to acquire large quantities of data and the connection to the SIU

is not still open (a RDYRX command has not been received yet)

a data-stop signal is asserted for stopping the data stream coming

into CARLOS from AMBRA.

3.4.4 Tests performed on CARLOS v1

The test of the CARLOS prototype has been carried on using the pat-

tern generator and logic analyzer HP16700A at the INFN Section in

Torino. Data were injected on the first connector, analyzed on the

second connector, while the third one has been connected to a SIU

extender board, which directly connects to the SIMU board. The SIU

extender is very useful for debugging purposes since it provides 5 logic

analyzer compatible connectors for analyzing signals being exchanged

in the interface CARLOS-SIU. Here follows a list of the test performed

on CARLOS:

1. functional test and compression algorithm verification;

2. opening of a transaction by manually pushing buttons on the

SIMU board;

3. event data transmission from CARLOS to the SIMU. The SIMU

does not store data, so the only way to check if data are correct

on not is by using the logic analyzer.

Prototype test was especially useful in order to design a perfectly com-

patible interface towards the SIU. The main difficulty in testing the

interface towards the SIU without a SIU board is due to the presence

of bidirectional pads: it is quite a difficult job to work with such pads

using a pattern generator.

Many corrections had to be applied to the original version in order to

68

3.5 — CARLOS v2

have a 100% compatible interface. The final VHDL version was then

frozen and then used for the ASIC implementation of CARLOS v2.

The VHDL model, in fact, does not depend on the technology chosen

for the implementation and is completely re-usable.

3.5 CARLOS v2

The first CARLOS prototype has been very useful for testing the com-

pression algorithm on a huge amount of data and for correctly designing

complex blocks as the interface towards the SIU, but it has many lim-

itations if compared to the final version we need to design. So far we

decided to pass to a second prototype of CARLOS with the following

features:

– 40 MHz clock frequency;

– 8 macro-channels parallel processing;

– small size for an easier use in test-beam environment;

– a JTAG port for downloading the Huffman look-up tables, the

threshold and tolerance values .

The CARLOS chip design has been logically divided into two main

parts, the first one designed in Torino and the second one in Bologna:

– a data compressor on 8 incoming streams, using the 1D compres-

sion algorithm. The compressor accepts 8-bit input data and gives

as output 32-bit words containing the variable length codes.

– a data packing and formatting block, a multiplexer selecting which

one of the 8 incoming streams has to be sent in output and an

interface block towards the SIU.

As you can see in Fig. 3.9 the main sub-blocks are 6: firstcheck, barrel,

fifo, event-counter, outmux, feesiu.

69


Figure 3.9: CARLOS v2 schematic blocks

70

3.5 — CARLOS v2

3.5.1 The firstcheck block

The I/O signals are:

– inputdata: input 32-bit bus;

– ck : input signal;

– reset : input signal;

– load : output signal;

– addressvalid : output 5-bit bus;

– datavalid : output 26-bit bus.

The firstcheck block takes as input the compressed codes coming from

the compression block and selects the useful bits while rejecting the

dummy ones. In fact the 32-bit input word has the following structure:

– bit 31: under-run bit: when set to 1 it means that incoming data

are dummy and have to be discarded; this may happen, for exam-

ple, when the run length encoder is packing long zeros sequences,

thus temporarily interrupting the data flow towards the SIU.

– bit 30 to 26: this 5-bit word contains the actual number of bits

that have to be selected by the following logic block, the barrel

shifter.

– bit 25 to 0: this 26-bit word contains the compressed code.

The real interesting bits are usually much less than 26, thus obtaining

a reduction in the data stream volume.

The firstcheck behavior is quite simple: when the reset signal is ac-

tive (active high) all outputs are set to 0; when reset is inactive the

firstcheck block samples the under-run bit value: when 1 all outputs are

set to 0, when 0 load is set to 1, addressvalid is assigned inpudata(30

downto 26) and datavalid is assigned inputdata(25 downto 0).

71


3.5.2 The barrel shifter block


– input : input 26-bit bus;

– sel : input 5-bit bus;

– load : input signal;



– end-trace: input signal;

– output-push: output signal;

– output : output 32-bit bus.

The barrel shifter has to pack all the valid data coming out from the

firstcheck block into a fixed-length 32-bit register word to be put in out-

put: in this way all dummy data are rejected and we have no more any

distinction between data-length and data itself. All data are packed in

the same word and can be easily reconstructed by using the Huffman

tree decoding scheme. If an input data cannot be completely stored

into a 32-bit word, it is broken into 2 pieces: the first as the MSBs of

the current output so to completely fill it, the second as the LSBs of

the following valid output word.

When the reset is active all internal registers and outputs are set to

0, when the reset is inactive the barrel shifter begins to wait for valid

data coming from the firstcheck block, that is data with the load signal

set to 1. When it happens the barrel shifter selects the valid bits from

input and packs them together in a 64-bit circular register word. When

32 bits are written on the register, the block asserts a signal output-

push high to communicate to the following block (the FIFO) that the

output is valid and has to be stored.

Two situations are very important for the barrel shifter working prop-

erly: when the load signal changes from 1 to 0 the barrel stops packing

72

3.5 — CARLOS v2

data and when load turns to 1 again the barrel begins packing data as

if no pause had happened.

The end-trace signal is asserted for one clock period in coincidence with

the last valid data: this data has to be packed together with the others,

then the 32-bit word has to be pushed in output (by putting output-

push to 1) even if it is not complete. After the end-trace and after

the last valid word has been sent to output the barrel shifter puts n

zero words as valid outputs: that number depends on how many words

have been sent to output from the beginning of the current event. In

fact the total number of valid words per event has to be an integer

multiple of 16. So far if (16k + 7) words have been sent in output after

the end-trace gets active n=9 zero words follows with output-push set

to 1. This condition is strictly related to the data transmission policy

and multiplexing of the 8 incoming data streams onto a single 32-bit

output, as will be explained in the next paragraph.

3.5.3 The fifo block


– datain: input 32-bit bus;


– push: input signal;

– pop: input signal;


– empty : output signal;

– full : output signal;

– query : output signal;

– dataout : output 32-bit bus.

The fifo block contains a double-port RAM block with 64 32-bits words

plus some control logic. Its purpose is to buffer the input data stream

73


and derandomize the queues that are waiting to be served by the out-

mux block. The buffer memory has to be large enough so to allow data

storing when the other queues are being served, since we have to avoid

block conditions. On the other side it cannot be too large since CAR-

LOS hosts 8 fifo blocks and the chip area is a strong design constraint.

The fifo allows 3 main storage operations:

– write only;

– read only;

– read/write at the same time but at different cell locations.

The FIFO allows to write data coming from the barrel shifter and to

read them when the queue has to be served by the outmux block. The

most important feature is that read and write operations can be exe-

cuted in parallel. In order to accomplish this feature the control logic

provides two pointers named address-write and address-read. They run

from 0 to 63 and then back to 0 in a circular way: obviously address-read

has always to follow address-write, otherwise we would be extracting

invalid data from the memory. Data is written in the fifo and the

address-write pointer is incremented by one when the input push is set

to 1: the input push of the fifo is the same signal as the output-push

one from the barrel. In this way when the barrel shifter has an output

valid, it is written in a free location of the fifo at the next clock cycle.

The RAM read phase is activated by the pop input signal: for every

clock cycle in which pop is 1, the data value corresponding to address-

read is taken in output dataout and then the pointer address-read is

incremented by 1. When both push and pop are set to 1 the fifo is

read and written at the same time and the distance between the two

pointers remains constant. Three important signals are:

– query signal: the query signal is set to 1 when the memory contains

at least 16 valid data, that is when the distance among the two

pointers is greater or equal to 16. The query signal is used at

the outmux block where a priority encoding based arbiter decides

74

3.5 — CARLOS v2

which of the 8 queues has to be served in output. When a fifo

block is served by the outmux, the number of total valid words

decreases and the signal query comes back to 0. It can happen

that the signal query remains to 1 if more than 32 valid words were

stored in the fifo. In this case it is possible that the fifo might be

read again. All depends on how many queues are sending queries

for being emptied to the scheduler.

– empty signal: the empty signal is set to 1 when the fifo does not

contain any valid data, that is when address-write and address-

read have the same value and are pointing to the same memory

location. This signal will be used by the feesiu block in order to

decide when all the 8 queues have been completely emptied and a

new data set can enter CARLOS.

– full signal: the full signal is very important since it is back-

propagated to the compressor block in order to assert the fact that

the FIFO is getting full and the input stream has to be stopped.

The compressor block will back-propagate this full signal to the

AMBRA chip which will stop sending data to CARLOS. Obvi-

ously the full signal has to be asserted before the FIFO is really

full, otherwise some input data would be lost. For this reason the

fifo full signal works between 2 thresholds: 32 and 48: the full

signal goes high when the fifo contains more than 48 valid words,

then it comes back to 0 only when the fifo has been served by the

outmux block, that is when the fifo contains less than 32 valid

words. With this trick the risk for the fifo to get completely full

is reduced, at least if the queues arbiter is fair enough with every

input stream.

3.5.4 The event-counter block



75




– event-id : output 3-bit bus.

The event-counter block is a very simple 3-bit binary counter used

to assign a number to every physical event, at least for being able to

easily discriminate consecutive events. When the reset is active internal

registers and outputs are put to 0, then, when the reset is inactive,

the event-counter block increments by one its output signal event-id

every time it samples the end-trace signal at logic level 1. The end-

trace feeding the event-counter block is a signal coming from the feesiu

block called all-fifos-empty. This signal is asserted for two clock periods

when all the 8 end-trace signals have been set to 1 and when all the

8 queues have been completely emptied. For this purpose CARLOS

contains a global end-trace signal which is activated when all the 8

local end-traces have been high for at least one clock period; it is not

strictly necessary that a temporal overlap exists between the 8 signals.

Nevertheless, this means that the global end-trace will never be put to

1 if some of the local end-traces are not used and remain stuck at 0.

After an end-trace global is activated, the feesiu block begins waiting

for the 8 FIFOs being emptied: as soon as this happens the all-fifos-

empty signal is activated and the event-id signal is incremented by one.

The signal all-fifos-empty stays at logical level 1 for two consecutive

clock periods: nevertheless the event-id counter is incremented only by

1. The value of event-id is used in the outmux block and it is sent to

the SIU as a part of the header word. We thought that 3 bits could be

sufficient to discriminate the events and for putting them in the right

order during data decompression and reconstruction stages.

3.5.5 The outmux block


– indat7 : input 32-bit bus;

76

3.5 — CARLOS v2










– query : input 8-bit bus;

– event-id : input 3-bit bus;

– enable-read : input signal;

– half-ladder-id : input 7-bit bus;

– good-data: output signal;

– read : output 8-bit bus;

– output : out 32-bit bus.

The outmux block has two distinct functions in the overall logic:

– multiplexing the 8 compressed and packed streams onto a single

32-bit output (femux sub-block);

– deciding which queue has to be served using a priority encoding

based arbiter (ppe sub-block).

The femux and ppe blocks implement the following 17-word data packet

transmission protocol (see Fig. 3.10):

– a 32-bit header;

– 16 32-bit data words, all coming from one macrochannel and from

one event.

77


Figure 3.10: 17-bit word data transmission protocol

The header contains the following information from MSB to LSB:

– half ladder id (7 bits): this number is hardwired externally to each

CARLOS chip, depending on the ladder it will be connected to;

– packet sequence number (10 bits): this is a 10-bit wide counter

incremented once a packet is transmitted, i.e. every 17 data words;

– cyclic event number (3 bits): this is the event number coming from

the event-counter block;

– available bits (9 bits): these will be used in a future expansion of

CARLOS;

– half detector id (3 bits): every half ladder contains 8 half detectors.

They are numbered from 0 to 7 and this number is provided by

the macro-channel being served.

Let’s take a look at the 2 sub-blocks of the outmux :

78

3.5 — CARLOS v2

– femux is a multiplexer with nine 32-bit inputs and a 9-bit selection

bus. The 9 data inputs are the header and the 8 input channels

coming from the FIFOs. The selection bus value is given by the

queues scheduler: this bus contains all zeros but one.

– ppe stands for programmable priority encoder. It is a completely

combinatorial block with two inputs and one output: request (8

bits) contains the query signals coming from the 8 macro-channels;

priority (8 bits) is a bus containing only one 1 and all the other

bits at 0; served (8 bits), like priority, contains only one bit at

logic level 1 and this bit indicates which of the 8 macro-channels

has to be served from the femux.

The programmable priority encoder works in a very simple way:

it scans the request bus starting from the bit stuck at 1 in the

priority bus until it finds a 1. Its bit position from 0 to 7 corre-

sponds to the channel chosen by the arbiter. At the next choice

that the arbiter has to take, the priority bus value is updated in

the following way: the served bus value is shifted on the right as if

it were a circular register and its value is assigned to the priority

bus. In this way we avoid the risk of a queue being served many

times consecutively in spite of other queues making requests. An

example will easily clarify this situation: request = 10100010, pri-

ority = 00010000, served = 00000010. At the next clock cycle,

the value ”00000001” will be assigned to the priority bus. There

are several possible implementations for a scheduling algorithm

based on a programmable priority encoder: they differ in area

and timing requirements. We chose the implementation used in

the Stanford University’s Tiny Tera prototype as described in [18].

I’ll try now to explain how the outmux block works: the outmux block

is stopped and it is initialized when the reset signal is active. When

the reset is inactive, the outmux block begins waiting for the enable-

read signal to get active. This is a signal coming from the feesiu block:

when low it states that the link between the SIU and CARLOS has

79


not been initialized yet or it means that temporarily the SIU cannot

accept data. When the enable-read is high, the SIU is able to receive

data from CARLOS, so the outmux block begins evaluating the value

of the query bus. When its value is low it means that no macro-channel

has still required to be served, otherwise the ppe block decides which

queue to send in output. The first word served as output is the header

word containing the information on the macro-channel being served

and other information as stated above in the paragraph. In order to

get the 16 data words to send as output, the outmux block has to

provide the right pop signal to send to one of the 8 FIFOs. The 8

pop signals to the FIFOs are grouped in the 8-bit read bus; of course

only one bit at a time will be asserted. Signal read(7) will be sent to

fifonew7, read(6) to fifonew6 and so on, as to extract 16 valid data

from the FIFO. Since we want to send data to the SIU at a 20 MHz

clock (half the system clock frequency) the pop signal cannot be stuck

at 1 for 16 clock periods but it is alternatively 0 and 1 in order to get

a data word out from the FIFO one clock period every two. When

the outmux block is putting in output the 17 words of the packet, the

output signal good-data is set to 1 in order to grant the feesiu block

that it is receiving significant data. While sending the last data word

of a packet, the outmux block updates the priority bus value as stated

above and examines the query bus value, then it computes the right

served value. If served is not 0, that is if any request has occurred, the

outmux block begins sending in output an other packet, without any

interruptions (there are not wasted clock periods), otherwise the block

stops waiting for a new request to be asserted. If the enable-read turns

from 1 to 0 when transmitting data, the outmux block sends only an

other valid word in output, then stops and waits for the enable-read

signal to be restored to its active value: then it continues sending data

to the feesiu block as if no pause had really occurred. The outmux

block itself provides to increment the 10-bit packet sequence number

after every packet has been completely transmitted.

The reason why a 20 MHz clock has been chosen is related to the

80

3.5 — CARLOS v2

total optical fibre bandwidth to be used by CARLOS: 800 Mbits/s. If

CARLOS puts in output 32-bit data at 40 MHz the total bandwidth

required is 1.280 Gbits/s, while at 20 MHz only 640 Mbits/s. For this

reason a half-frequency data rate has been chosen as the final one.

3.5.6 The feesiu (toplevel) block


– huffman7 : input 32-bit bus;










– end-trace7 : input signal;








– fidir : input signal;

81


– fiben-n: input signal;

– filf-n: input signal;


– wait-request7 : output signal;








– foclk : output signal;

– fbten-n: bidirectional signal;

– fbctrl-n: bidirectional signal;

– fobsy-n: output signal;

– fbd : bidirectional 32-bit bus.

The VHDL feesiu block contains all the other block instances (see

Fig. 3.11) and the logic working as interface with the SIU board. So

far the feesiu block contains 8 instances of firstcheck, 8 instances of

barrel, 8 instances of fifo, 1 instance of event-counter and 1 instance of

outmux. However we can imagine the feesiu block as the block taking

data from the outmux block and directly interfacing the SIU board, as

if it were at the same hierarchical level as the other blocks. In Fig. 3.9

the feesiu block is represented exactly in this fashion.

3.5.7 CARLOS-SIU interface

Let’s now take a look the interface signals between CARLOS and the

SIU and how the communication protocol has been implemented:

82

3.5 — CARLOS v2

Figure 3.11: Design hierarchy of CARLOS v1

– fidir : it’s an input to CARLOS. It asserts the direction of the

data flow between CARLOS and the SIU: when low, data flow is

directed from the SIU to CARLOS, otherwise data flow is directed

from CARLOS to the SIU.

– fiben-n: it’s an input to CARLOS, active low. It enables the com-

munication on the bidirectional buses between CARLOS and the

SIU. When low, communication is enabled, otherwise communi-

cation is disabled.

– filf-n: it’s an input to CARLOS, active low, ”lf” stands for link

full. When the SIU is no longer able to accept data coming from

CARLOS, it puts this signal active. When this happens CARLOS

sends an other valid data word, then stops transmitting waiting

for the filf-n signal to be asserted again. This is the signal used by

the SIU to implement the back-pressure on the data flow running

from the front-end to the data acquisition system.

– foclk : it is a free running clock generated on CARLOS and driving

83


the CARLOS-SIU interface. It is a 20 MHz clock generated by

dividing the system clock frequency by 2. Interface signals coming

from the SIU are triggered on the falling edge of foclk.

– fbten-n: it is a bidirectional signal, active low, it can be driven by

CARLOS or by the SIU, ”ten” stands for transfer enable. When

CARLOS is assigned to drive the bidirectional buses (when fidir

is high and fiben-n is 0) fbten-n value is asserted from CARLOS: it

turns to its active state when CARLOS is transmitting valid data

to the SIU, otherwise it is inactive. When the SIU is assigned

to drive the bidirectional buses (when fidir is 0 and fiben-n is

0) fbten-n value is asserted from the SIU: it turns to its active

state when the SIU is transmitting valid commands to CARLOS,

otherwise it is inactive.

– fbctrl-n: it is a bidirectional signal, active low, it can be driven by

CARLOS or by the SIU, ”ctrl” stands for control. When CARLOS

is assigned to drive the bidirectional buses (when fidir is 1 and

fiben-n is 0) fbctrl-n value is asserted from CARLOS: it turns

to its active state when CARLOS is transmitting a Front End

Status Word to the SIU, otherwise, when in the inactive state,

CARLOS is sending normal data to the SIU. When the SIU is

assigned to drive bidirectional buses (when fidir is 0 and fiben-n

is 0) fbctrl-n value is asserted from the SIU: it turns to its active

state when sending command words to CARLOS, to its inactive

state when sending data words. The second option has not been

implemented on CARLOS since we decided that CARLOS needs

only commands and not data from the SIU. Other detectors use

this option in order to download data to the detector itself: this

is the case, for example, of the Silicon Pixel Detector.

– fobsy-n: it is an input signal to the SIU, active low, ”bsy” stands

for busy. CARLOS should put this signal active when not able

to accept data coming from the SIU. Since CARLOS has not to

receive data from the SIU, this signal has been stuck at 1, meaning

84

3.5 — CARLOS v2

that CARLOS will never be in a busy state. In fact it always has

to accept command words coming from the SIU.

– fbd : it is a bidirectional 32-bit bus on which data or command

words are exchanged between CARLOS and the SIU.

This is the way the communication protocol works: the SIU acts as the

master and CARLOS acts as the slave, i.e. the SIU sends commands to

CARLOS and CARLOS sends data and front end status words to the

SIU. At first the link CARLOS - SIU has to be initialized and the SIU

acts as the master of the bidirectional buses. So CARLOS waits for the

bidirectional buses to be driven from the SIU (fidir is 0 and fiben-n is

0) and waits for a valid (fbten-n = 0) command (fbctrl-n = 0) named:

Ready to Receive (RDYRX). This command is always used in order

for a new event transaction to begin. The RDYRX command contains

a transaction identifier (bits 11 to 8) and the string ”00010100” as the

less significant bits.

As the command is accepted and recognized, CARLOS waits for the

fidir signal to change value in order to take possession of the bidirec-

tional buses, then, if the filf-n is not active, it is able to send valid

data on the fbd bus if the good-data signal is active. In this state,

CARLOS sends valid data of an event to the SIU only when some

queues are making requests of being served in output, otherwise the

feesiu stops sending data by putting the fbten-n signal to 1. When

an end-trace signal has arrived on each macrochannel and every queue

has been completely emptied (no more data of a particular event are

stored in CARLOS yet), CARLOS puts in output the Front End Sta-

tus Word (FESTW), a word that confirms that no errors occurred and

that the whole event has been successfully transferred to the SIU. The

FESTW contains the Transaction Id code received upon the opening of

the transaction (bits 11 to 8) and the 8-bit FESTW code ”01100100”.

After this happens CARLOS begins to wait for some action of the SIU

to be taken: it means that the SIU can decide to take back its control

on the bidirectional buses and close the data link towards the data ac-

85


quisition system, or the SIU can leave the bidirectional buses control to

CARLOS for an other data event to be sent. So far, CARLOS begins

waiting 16 foclk periods: if nothing happens, CARLOS is able to begin

sending data again without the need to receive some other commands

from the SIU; if the SIU takes back the possession of the bidirectional

buses, CARLOS closes the link towards the SIU and keeps waiting for

an other RDYRX command raised from the SIU itself.

The feesiu block implements this communication protocol with the SIU

using a simple state-machine: for example state 0 is the state in which

CARLOS is waiting for a command of initialization from the SIU, state

1 is the state in which CARLOS sends data from the SIU, state 2 in

which CARLOS sends the front end status word to the SIU, state 3

in which CARLOS waits 16 foclk periods waiting for some action from

the SIU to happen.

An important feature of CARLOS realized in the feesiu block is the

following one: CARLOS cannot accept a new event before the previ-

ous one has been completely sent in output, otherwise we run into the

risk of mixing data belonging to different events. The only way CAR-

LOS has to implement back-pressure on the AMBRA chips is using the

wait-request signals. So far the wait-request signal has to avoid that

CARLOS fetches new input data values while emptying the FIFOs.

For this reason a new signal, dont-send-data, has been introduced for

every macro-channel which turns to 1 when the end-trace is activated

and turns back to 0 when all the FIFOs are completely empty. So

the wait-request of every macro-channel is obtained by putting in OR

the full and dont-send-data signals. The feesiu acknowledges that all

the FIFOs have been emptied using the empty signal of every FIFO

block. When all the 8 signals turn to 1 the feesiu block raises the all-

fifos-empty signal which stands at logical level 1 for at least two clock

periods in order to be sensed by the foclk clock. The all-fifos-empty sig-

nal is also used to trigger the event-counter block: in fact the number

of total events is exactly the same as the total number of occurrences

of the all-fifos-empty signal. An other signal, end-trace-global is set to

86

3.6 — CARLOS v2 design flow


1 only if all the local end-trace signals have been put to 1 for at least

one clock period in the current event. From the moment in which the

end-trace-global is asserted and when the all-fifos-empty is activated

no new input data set can enter CARLOS.

3.6 CARLOS v2 design flow

Fig. 3.12 illustrates the digital design flow for CARLOS v2. The front

end steps are exactly the same as the ones followed in the design of

CARLOS v1. The only difference is the library used, being, in this

case, the Alcatel Mietec 0.35 µm digital library provided via Euro-

practice. This is a very rich library since it contains more than 200

differents standard cells and RAM blocks with several dimensions. A

87


Figure 3.13: Layout of the ASIC CARLOS v2

RAM generator software allows the designer to get a macrocell with

the exact number of words and bits per word as requested: in our case

a 64 32-bit macrocell instantiated 8 times, one for macrochannel.

The back end steps were carried out at IMEC using the Avant! soft-

ware Acquarius. We could not succeed to get a license of this software

due to the high cost (more then 100k$ for a license), while no other

available software, such as Cadence, was able to work with the design

kit provided. The final physical layout is depicted in Fig. 3.13. The

chip has a total area of 30 mm2 containing 300 k standard cells, 180

88

3.7 — Tests performed on CARLOS v2

I/O pads and 24 RAM blocks.

After the design of the layout, IMEC sent us the post-layout netlist

and a SDF file (Standard Delay Format) containing the information

on each net and cell delay for post-layout simulation with the same

test-benches already used for pre-layout simulation. This is usually an

iterative process since, if some simulation problems arise, the layout

has to be re-designed. Luckily due to the relatively small working fre-

quency (40 MHz) (the technology adopted can easily work up to 200

MHz) the post-layout simulation gave no problems and the design was

then sent to the foundry.

3.7 Tests performed on CARLOS v2

After receiving from the Alcatel Mietec foundry 20 samples of naked

chips (without any package), they have been directly bonded on the

test PCB at the INFN of Torino, one sample per PCB. The test PCB

shown in Fig 3.14, especially designed for testing CARLOS v2 and for

its use in test beam data taking, contains the following:

– 5 2x10 pins DIL connectors pin compatible with the pattern gen-

erator and logic analyzer HP16600/16700A pods;

– 2 Mictor 38 connectors;

– a DIP switch providing a facility to setup the hardwired parame-

ters, such as the half ladder ID;

– filter capacitors for a total capacity greater than 100 nF;

– buffers for preserving CARLOS input pads integrity.

After testing the JTAG control unit on CARLOS, the connection to-

wards the SIMU was successfully tested: after the SIMU opens a trans-

action, CARLOS takes possession of the bidirectional buses and starts

sending data. After these tests, the SIMU has been replaced by the

SIU board and all the data acquisition system, i.e. DIU (Destination

89


Figure 3.14: CARLOS v2 test board

Interface Unit) and PCI RORC (Read Out Receiver Card) directly con-

nected to a PC. So far testing CARLOS behavior with huge amounts

of data becomes easier to simply use the Logic State Analyzer and the

complete data acquisition system can be used to acquire data in test

beams.

90

Chapter 4

2D compression algorithm

and implementation

This chapter contains a brief description of the 2D algorithm [19] con-

ceived at the INFN Section of Torino and a first implementation at-

tempt in ASIC with the third prototype of CARLOS.

4.1 2D compression algorithm

The 2D algorithm operates a data reduction based on a two-threshold

discrimination and a two-dimensional analysis along both the drift time

axis and the SDD anode axis. The proposed scheme allows for a bet-

ter understanding of the neighbourhoods of the SDD signal clusters,

thus improving their reconstructability and also provides a statistical

monitoring of the background features for each SDD anode.

4.1.1 Introduction

As shown in Chapter 3, due to the presence of noise a simple single-

threshold one-dimensional zero suppression does not allow a good clus-

91

2D compression algorithm and implementation

ter reconstruction in all circumstances. Indeed in order to obtain a

good compression factor using the 1D algorithm a threshold of about

three times the RMS of the noise has to be used. Such threshold often

determines a rather sharp cut of the tails of the anode signals contain-

ing high samples and, more important, it can completely suppress the

anodic signals with small values which are on the sides of the cluster.

Both these sharp cuts, particularly the latter, can significantly affect

the spatial resolution. Though samples below a 3 RMS threshold have

small information contents, it is conceivable that, in the more accurate

off-line analysis, they can help to improve the pattern recognition and

the fitting of the cluster features. In order to read out small-amplitude

samples without increasing too much the collection of the noise, a two-

threshold algorithm can be used, so that small samples that satisfy a

low threshold are collected only when, along the drift direction, they

are near to samples satisfying a high threshold. Since the charge cloud

diffuses in two orthogonal directions for symmetry reasons and due the

previous considerations, the two-threshold method should be applied

along the anode axis too. We want that such a two-threshold two-

dimensional data compression and zero suppression algorithm satisfy

the following criteria:

– the values of the samples, in the neighbourhood of a cluster, be

available both for an accurate measurement of the characteristics

of the clusters and for a good monitoring and understanding of

the characteristics of the background;

– the statistical nature of the suppressed samples be available to

monitor the noise level of the anodes and to obtain their baseline

values, which have to be subtracted from the cluster samples in

order to obtain a correct measurement of the related charge.

Here follows a description of the studied algorithm: the data reduc-

tion algorithm is applied to the resulting matrix of 256 rows by 256

columns like the one shown in the upper part of Fig. 4.1. Each matrix

element expresses an 8-bit quantized amplitude. A row represents a

92


Figure 4.1: Example of the digitized data produced by a half SDD

time sequence of the samples from a single SDD anode and a column

represents a spatial snapshot of the simultaneous anode outputs for an

instant of time. For each charge cloud we expect several high values in

one or more columns and rows. This extension in both time and space

thus requires that correlations in both dimensions be preserved for fu-

ture analysis. We refer to correlations within a column as space-like

and correlations within a row as time-like. Therefore, in the proposed

two-threshold two-dimensional algorithm, the high threshold TH must

be satisfied by a pixel value in order that it be part of a cluster, and the

93


EW

S

C

N

Figure 4.2: Neighbourhood of the pixel C

low threshold TL leads to the registering of a pixel whose value satisfies

it, if adjacent to an other pixel satisfying TH . In this way the lower

value pixels on the border of a cluster are encoded thus ensuring that

the tails of the charge distribution are retrieved.

Within this framework, a cluster is redefined operationally as a set of

adjacent pixels whose values tend to stand out above the background.

In the described algorithm there is a trade-off in the definition of such

a cluster, which lies in the definition of adjacency. We have considered

as adjacent (or neighbour) to the (i, j) element, the pixels for which

only one of the two indexes change by 1: so far the neighbour pixels are

(i− 1, j), (i+ 1, j), (i, j − 1) and (i, j + 1). Thus a correlation involves

a quintuple composed of a central (C) pixel and its north (N), south

(S), east (E) and west (W) neighbours only (see Fig. 4.2. In order to

monitor the statistical nature of the suppressed samples, the number of

zero quantized values (due either to negative analog values of the noise

or to baseline equalization), and the numbers of samples satisfying TH

and TL are recorded. The background average and standard devia-

tion are obtained by applying a minimization procedure to the three

counted data. An aspect of this reduction algorithm allows the conser-

vation of information about the background both near and far from the

clusters. When the thresholds are properly chosen, statistically, pairs

and a few triplets of background pixels not associated with a particle-

produced cluster will satisfy the described discrimination criteria and

94


Figure 4.3: Cluster in two dimensions and its slices along the anode di-

rection

provide consistency information on the background statistics, assumed

to be Gaussian white noise. At the same time single high background

peaks are suppressed as zeros (if they do not have at least one neigh-

bour that satisfies at least the low threshold) so as not to overload the

data acquisition and to allow an efficient zero suppression. The only

parameters needed as input to the 2D compression algorithm are the

two thresholds, TH , TL and the baseline equalization values.

4.1.2 How the 2D algorithm works

The 2D algorithm makes use of two threshold values:

– a high threshold TH for cluster selection;

– a low threshold TL so to collect information around the selected

cluster.

The algorithm retains data belonging to a cluster and around a cluster

in the following way (as graphically shown as an example in Fig. 4.3):

– the pixel matrix is scanned searching for values higher than the

TH value (70 in Fig. 4.3);

– the pixels positioned around the previously selected ones are ac-

cepted if higher than the low threshold value TL (40 in Fig. 4.3),

otherwise they are rejected;

95


– thus a cluster is defined and cluster values are saved exactly as

they are: other pixels, not belonging to clusters, are discarded;

– if a pixel value higher than the TH value is found but it has not

pixel values higher than TL around its value is rejected. This is

the case of the 78 value on the bottom-left corner in Fig. 4.3 which

is discarded, even it its value is greater than the high threshold

value.

– pixel values belonging to a cluster are encoded using a simple look-

up table method, assigning long codes to non-frequent values and

short codes to frequent symbols.

So far in Fig. 4.3, after applying the 2D compression algorithm, only

the shadowed values are stored, while the other value ares erased. The

2D algorithm is conceptually very simple to understand, but it is quite

more complex than the 1D for what concerns hardware implementation.

In fact having to perform a bi-dimensional analysis of the pixel array

implies the need of storing all the information on a digital buffer on

CARLOS, thus requiring a larger silicon surface and a higher cost.

4.1.3 Compression coefficient

Fig. 4.4 shows the 2D compression coefficient as a function of the high

threshold value, calculated using data coming from the test beam of

September 1998. The 2D compression algorithm reaches a compression

ratio of 22 choosing TH value of 1.5 noise RMS and TL of 1.2 noise

RMS. It is to be remembered that the 1D compression algorithm had

to use a threshold level of 3 noise RMS in order to reach the target

compression ratio. So far the 2D algorithm shows higher performances

than the 1D since it reaches the target compression ratio, while losing

a lower amount of physical information. This is the main reason why

the 2D algorithm has been chosen as the one that will be implemented

on the final version of CARLOS.

96


Figure 4.4: 2D compression coefficient ratio as a function of the high

threshold

4.1.4 Reconstruction error

Even for what concerns the reconstruction error, the 2D algorithm

proves to have better performances than 1D. In fact the difference val-

ues between cluster centroid position before and after compression are

fitted by a Gaussian distribution centered around the 0 value with a

σ value of 10 µm along the drift time direction and 10 µm along the

anode direction, choosing 1.5 noise RMS for TH and 1.2 noise RMS for

TL. So far the 2D algorithm manages to achieve a better cluster center

resolution than 1D by keeping track of more pixel values around the

cluster center. Moreover the 2D algorithm introduces a smaller bias on

the reconstructed charge than 1D with a value of around 3 %, meaning

that the reconstructed cluster charge is 3 % lower than before compres-

sion - decompression steps.

Beside that the 2D algorithm is very useful for what concerns the study

of the noise distribution: in fact monitoring the couples of noise sam-

ples passing the double threshold filter allows to recover information

on the average and on the standard deviation of the Gaussian noise

distribution. This is quite important for checking how the signal to

background ratio changes in time.

97


If used in lossless mode, the 2D compression ratio is 1.3 versus the

2.3 value obtained using the lossless version of the 1D algorithm: this

requires a more complex second level compressor in counting room, in

order to reach the target compression ratio of 22, in the case the 2D

compression algorithm cannot be applied to data. In fact there are

some cases in which it might prove no longer desirable the use of the

2D compression algorithm: for example when the baseline value is not

constant through the 256 samples of an anode row. This is the case of

the present version of the PASCAL chip, which introduces a slope in

each anode row baseline and, what is worst, the slope value varies from

different rows. It is obvious that a fixed double-threshold compressor,

as the one explained in this Chapter, cannot deal with this problem. So

far the foreseen solution is to eliminate the baseline slope in the final

version of PASCAL. If this proves to be not possible or if a baseline

with slope behavior emerges after some working time, the use of the

2D algorithm can no longer be accepted. In this case data compression

on CARLOS has to be switched off and a second level compressor al-

gorithm implemented directly in counting room will do the job.

4.2 CARLOS v3 vs. the previous proto-

types

There are several differences between CARLOS v3 and the previous

versions. This is a brief list containing the most important ones:

1. CARLOS v1 and v2 were meant to work in a radiation free envi-

ronment, since, when they were designed, the problem of radiation

had not been faced yet. So far commercial technologies such as

Xilinx FPGAs or Alcatel Mietec design kit have been chosen for

prototype implementation. The necessity for CARLOS to work in

a radiation environment emerged some times after sending CAR-

98

4.2 — CARLOS v3 vs. the previous prototypes

LOS v2 to the foundry. The radiation level CARLOS has to with-

stand is in the range from 5 to 15 krads. This led us to the search

of a radiation-safe technology.

One of the possible solutions is given by SOI (Silicon On Insulator)

technology which provide a complete radiation resistance. This is

the case for instance of the 0.8 µm DMILL technology that is be-

ing widely used even in satellite applications at ESA (European

Space Agency). The problem related to this technology in mainly

one: the cost is too high for our budget. So far we decided to

choose a commercial technology, IBM 0.25 µm, with a library of

standard cells designed to be radiation tolerant up to some Mrads.

The library has been designed by the EP-MIC group at CERN.

2. Mechanical constraints emerged not allowing the use of the SIU in

the end-ladder zone, since it is far too big for the space available.

Another problem concerning the SIU is that this device cannot

safely work in a radiation environment since it contains commer-

cial devices, such as ALTERA PLDs. Finally the laser driver

hosted on the SIU board has a mean life of a few years, while we

are looking for something lasting until the end of the experiment

data taking.

These considerations led us to change all the readout architec-

ture from CARLOS to the DAQ. Instead of directly interfacing

the SIU, CARLOS v3 interfaces the radiation-tolerant serializer

GOL chip (Gigabit Optical Link) [20]. Serial data is then sent

to the counting room using a 200 m long optic fibre, deserialized

using a commercial deserializer device and then sent to the SIU

board using a FPGA device named CARLOS-rx that is still to

be designed. This final readout architecture is shown in details in

Fig. 4.5.

3. CARLOS v3 contains only 2 data processing channels, versus the

8 hosted in the two previous prototypes. This choice was due to

99


the need of reducing the ASIC complexity and to greatly reduce

the possibility of losing data in case of chip failure. In fact if

a CARLOS v2 chip breaks down for some reasons, data coming

from a half-ladder, i.e. from 4 detectors, is completely lost until

the chip is substituted with a working one. On the other side, if

a CARLOS v3 chip breaks down, only data coming from an SDD

detector are lost. So far a 2-channel version of CARLOS provides

a greater failure resistance and is far less complex.

4. CARLOS v3 contains a preliminary interface with the TTCrx chip

that distributes trigger signals and the clock to the end-ladder

board.

5. CARLOS v3 also contains a BIST structure (Built In Self Test)

for a quick test of the chip itself issued via the JTAG port.

Figure 4.5: The final readout chain

100

4.3 — The final readout architecture

4.3 The final readout architecture

The chosen architecture for the final readout system introduces new

items to carry on and new problems to solve.

For instance splitting CARLOS in 4 chips makes every chip much sim-

pler to design, test and control (CARLOS v2 is a very complex and

difficult to debug chip), but moving the SIU board in counting room

implies the design of the CARLOS-rx device taking data from 4 dese-

rializer chips and feeding data to the SIU.

Beside that, putting a 200 m distance between CARLOS and the SIU

implies that no back-pressure can be used: in fact if the SIU asserts

the filf − n signal, meaning that it cannot accept further data start-

ing from the following foclk signal, CARLOS receives this information

after 2 µs, i.e. after 40 foclk cycles. So far the CARLOS-rx chip has

to contain a well-sized FIFO buffer chip to store data when the SIU is

not able to accept them.

The role of the JTAG link is shown in Fig. 4.6. In the new architecture

a transaction can be opened and closed via the JTAG link, instead of

using the 32-bit bus fbd. The JTAG link is obtained serializing the

5-bit JTAG port coming from the SIU for transmission to the front-

end zone through an optic fibre, then the HAL (Hardware Abstraction

Layer) chip performs the serial to parallel conversion for distributing

the JTAG signals to the PASCAL, AMBRA and CARLOS chips. A

rad-hard version of the HAL chip has to be implemented yet.

Currently we plan to use a commercial pair of chips for serializing-

deserializing data from Agilent Technologies: in the final architecture

the serializer chip will be substituted with the rad-hard Gigabit Optical

Link (GOL) chip designed by the Marchioro group at CERN. This chip

is a multi-protocol high-speed transmitter ASIC, wich is able to with-

stand high doses of radiation. The IC supports two standard protocols,

the G-Link and GBit-Ethernet and sustains transmission data at both

800 Mbits/s and 1.6 Gbits/s. The ASIC was implemented using CERN

library 0.25 µm CMOS technology employing radiation tolerant layout

101


Figure 4.6: Final readout chain zoom

techniques.

A problem concerning the use of the GOL chip is to be solved yet: the

TTCrx chip distributes to all front-end chips a clock with a maximum

jitter of around 300 ps. This is not a problem for AMBRA and CAR-

LOS ICs working at 40 MHz but it proves to be a big problem for the

GOL chip, since it contains an internal PLL to multiply the incoming

40 MHz clock by 20 or 40, so to get an internal 800 MHz or 1.6 GHz

frequency. The PLL shows some synchronization problems with the

incoming clock if the input jitter is greater than 100 ps. This problem

has still to be faced and solved.

4.4 CARLOS v3

CARLOS v3 is our first prototype tailored to fit in the new readout

architecture. The main new features of this chip are:

102

4.5 — CARLOS v3 building blocks

– two processing channels;

– the radiation tolerant technology chosen.

Nevertheless CARLOS v3 does not contain the complete 2D compres-

sion algorithm as would be expected. We made this choice in order to

acquire experience with a small chip with the new technology and with

the new layout techniques since we had to carry out the layout design

task. Taking into account that the CERN 0.25 µm library contains a

small number of standard cells and they are not so well characterized as

commercial ones, we decided to try the new design flow and new tech-

nology with a simple chip: the result is CARLOS v3, that has been

sent to the foundry in November 2001 and will be tested starting from

February 2002.

As a compression block, CARLOS v3 only hosts the simple encoding

scheme conceived as the final part of the 2D algorithm. Nevertheless if

CARLOS v3 proves to be perfectly working, it will be used to acquire

data in the test beams and will allow us to build and test the foreseen

readout architecture.

4.5 CARLOS v3 building blocks

Fig. 4.7 shows the main building blocks of CARLOS v3. The complete

design of CARLOS v3 has been carried out in Bologna: I have worked

on the VHDL models, while other people worked on the C++ models

of the same blocks. Each block has been designed both in VHDL and

C++, so to allow an easy verification and debugging process.

The main two processing channels are the ones with encoderbo, bar-

rel15, fifonew32x15 and the outmux blocks: these blocks take data

coming from the AMBRA chips, encode them using a lossless compres-

sion algorithm, pack them into 15-bit words and store them in a FIFO

memory before sending them in output to the GOL chip one channel

after the other.

103


Figure 4.7: CARLOS v3 building blocks

104


The channel containing the ttc-rx-interface and fifo-trigger15x12 re-

ceives trigger numbers (bunch counter and event counter) from the

TTCrx chip and sends them in output at the beginning of each data

packet. The event-counter block is a local event number generator pro-

viding a further information to be added to the event number coming

from the TTCrx chip: this gives us a greater confidence of being able to

reconstruct data and to find errors if present. Then a trigger-interface

block handles the trigger signals L0, L1 and L2 coming from the Cen-

tral Trigger Processor (CTP) through the TTCrx chip. A Command

Mode Control Unit (CMCU ) receives commands issued through the

JTAG port and puts CARLOS in one of some logic states: running,

idle, bist and so on. Finally the BIST blocks on chip are based on a

pseudo-random pattern generator and a signature maker circuit. Next

paragraph contain a detailed description of these blocks.

4.5.1 The channel block

The channel block is the main processing unit contained in CARLOS

for data encoding, packing and storing. It is composed by three blocks:

encoderbo, barrel15 and fifonew32x15. Two identical channel blocks

are hosted on CARLOS v3.

4.5.2 The encoder block


– value: input 8-bit bus;

– value-strobe: input signal;



– data: output 10-bit bus;

– field : output 4-bit bus;

105


Input range Output code Total

0-1 1 bit + 000 4 bits

2-3 1 LSB bit + 001 4 bits

4-7 2 LSB bits + 010 5 bits

8-15 3 LSB bits + 011 6 bits

16-31 4 LSB bits + 100 7 bits

32-63 5 LSB bits + 101 8 bits

64-127 6 LSB bits + 110 9 bits

128-255 7 LSB bits + 111 10 bits

Table 4.1: Lossless compression algorithm encoding scheme

– valid : output signal.

The encoderbo block encodes 8-bit input data in variable length codes

in the range from 4 to 10 bits long in a completely lossless way. Table

4.1 contains a detailed description of the encoding mechanism. This

encoding scheme provides a compression on input data based on the

knowledge of the statistics of the stream: in fact small-value data are

much more probable than high-value ones. So far most input data will

be reduced from 8 to 4 or 5 bits, providing some degree of compression.

Indeed it is possible that locally, in time, this compressor may provide

an expansion of data: in fact if a long sequence of values greater than

127 occur, the encoderbo block provides as output a stream of 10-bit

data, that have to be temporarily stored in a FIFO buffer. Here is

a description of how the block actually works: when the input signal

value-strobe is high, the 8-bit input value is encoded in the 10-bit output

data and the valid output signal is asserted. The field output signal

is assigned the number of bits actually containing information in the

10-bit data register. The block is synchronous with the rising edge of

the clock, while the reset signal is active high and asynchronous.

106


Figure 4.8: Graphical description of how the barrel shifter works

4.5.3 The barrel15 block


– input : input 8-bit bus;

– sel : input 4-bit bus;

– load : input signal;




– output-push: output signal;


The barrel15 is the block packing the 4 to 10 bits variable length codes

coming from the encoderbo block to a fixed length 15-bit word. Data

are packed as shown in Figure 4.8. The barrel block makes use of two

internal 15-bit registers, so to be able to break an input data in two

pieces without losing any information: when the first word is put in

output by putting the output signal output-push low, the second word

is used to store the input data. The latency of the barrel block is of

107


2 clock periods: it means that it takes 2 clock periods before a word

is packed by the barrel15 block. When the input signal end-trace is

asserted, meaning that this is the last data belonging to the current

event, the current value in the internal register is put in output even if

it is not completely full: not defined bits are put to 0.

Data coming from the barrel can be easily reconstructed by starting

from the 3 LSBs of the first barrel word containing the information of

how many bits have to be selected on the left side of the code. By

going on in this way from the LSB to the MSB of every valid word, it

is possible to retrieve all the encoded information.

4.5.4 The fifonew32x15 block


– push-req-n: input signal;

– pop-req-n: input signal;

– diag-n: input signal;

– data-in: input 15-bit bus;



– empty : output signal;

– almost-empty : output signal;

– half-full : output signal;

– almost-full : output signal;

– full : output signal;

– error : output signal;

– dataout : output 15-bit bus.

The fifonew32x15 block has the purpose of storing information coming

out from the barrel shifter. The multiplexing scheme that has been

108


chosen cannot avoid the use of buffers before the multiplexer: in fact

since the output data is fairly allocated 50 % of the time to both chan-

nels (one clock period for channel 0, the next clock period for channel

1 and so on) and since the encoding algorithm can locally, in time, be-

have as an expansor, data has to be locally stored before multiplexing.

The only decision that has to be taken is about FIFO dimensions: we

have chosen a FIFO containing 32 words coming from the barrel shifter

(32x15 bits) in order to take into account the worst possible input data

stream. The problem we have faced designing the FIFO block is the

following one: a FIFO is usually composed of a dual port RAM block

plus some logic for implementation of the First In First Out phylosophy.

This is for example what has been done in CARLOS v2. Nevertheless

the CERN library 0.25 µm only provides one size of RAM memories,

that is 64x32 bits size. This block is at least 4 times bigger than the

block dimensions we need (2048 bits versus 480). Beside that it is quite

difficult, if not impossible, to share the same RAM block between two

different FIFO designs: the idea to share the FIFOs of the two channels

is quite difficult to implement since the number of read/write ports has

to be doubled. So far we decided to design a flip-flop based RAM for

the FIFO taken from the “Designer Foundation” library provided to-

gether with our design software Synopsys. This is a library containing

IP (Intellectual Property) blocks ready to be inserted into a design such

as logic and arithmetic blocks, RAMs and application-specific blocks,

for instance for error checking and correction or for a JTAG controller.

The idea is: it is completely useless that every ASIC designer loses

time while designing a block that is necessary to hundreds of other de-

signers in all over the world. With this idea in mind, many IP libraries

have been collected such as the one provided by Synopsys we have been

making use of.

This is the behavior of the fifonew32x15 block: a push is executed when

the push-req-n input is asserted (low) and either the full flag is inactive

(low) or the full flag is active and the pop-req-n input is asserted (low).

So far a push can occur even if the FIFO is full, as long as a pop is

109


executed in the same cycle period. Asserting push-req-n in either of

the above cases causes the data at the data-in port to be written to

the next available location in the FIFO. A pop operation occurs when

pop-req-n is asserted (LOW), as long as the FIFO is not empty. As-

serting pop-req-n causes the internal read pointer to be incremented on

the next rising edge of ck. Thus the RAM read data must be captured

on the ck following the assertion of pop-req-n. Push and pop can occur

at the same time if there is data in the FIFO, even when the FIFO is

full. In this case first the pop data is captured by the next stage of

logic after the FIFO and then the new data is pushed into the same

location from which the data was popped. So far there is no conflict in

a simultaneous push and pop when the FIFO is full. A simultaneous

push and pop cannot occur when the FIFO is empty since there is no

pop data to prefetch.

The FIFO block contains some important flags such as empty, almost-

full, full. The empty flag indicates that there are no words in the FIFO

available to be popped. The almost-full flag is asserted when there

are no more than 8 empty locations left in the FIFO. This number is

used as a threshold and is very useful for preventing the FIFO from

overflowing. When this flag is asserted the data-stop signal, output

from CARLOS, is sent to the AMBRA chip asking to stop the data

stream transmission. AMBRA requires 3 clock cycles before it actually

stops sending data to CARLOS. So far the threshold level 8 chosen

for the FIFO design has to take into account for these 3 clock periods

delay due to AMBRA and for the latency due to the encoder and barrel

blocks. So far this flag is very useful for managing data transmission

between AMBRA and CARLOS without losing any data. The last flag

full indicates that the FIFO is full and there is no space available for

pushing data. If AMBRA - CARLOS communication works well this

flag should never be asserted. Fig. 4.9 shows the FIFO timing wave-

forms during the push phase, while Fig. 4.10 shows the FIFO timing

waveforms during the pop phase.

110


Figure 4.9: FIFO timing waveforms during the push phase

Figure 4.10: FIFO timing waveforms during the pop phase

4.5.5 The channel-trigger block

The channel-trigger block has the purpose of getting trigger numbers

from the TTCrx chip and store them before they are multiplexed and

sent to the GOL chip. It is composed by two different blocks: the

111


ttc-rx-interface and the fifo-trigger block.

4.5.6 The ttc-rx-interface block


– TTCready : input signal;

– BCnt : 12-bit input bus;

– BCntLStr : input signal;

– EvCntLStr : input signal;

– EvCntHStr : input signal;



– BCnt-reg : output 12-bit bus;

– EvCntL-reg : output 12-bit bus;

– EvCntH-reg : output 12-bit bus.

The ttc-rx-interface block receives trigger information from the TTCrx

chip when the input signal TTCready coming from the TTCrx chip

is high, meaning that the TTCrx is ready. When BCntStr is high,

the 12-bit input word is fetched in the register BCnt-reg, the same for

EvCntLStr and EvCntHStr for the MSB and LSB of the 24-bit word

event counter. Following a L2accept signal active the values of these

three registers are written into 3 memory locations of the fifo-trigger

block. Since the event can be discarded until the final confirmation

arrives through signal L2accept it is necessary to wait for such a signal

before storing them in the FIFO.

4.5.7 The fifo-trigger block

This block is logically equivalent to the FIFO block except for what

concerns dimensions: its size is 15x12 words. During the transmission

112


of a complete event from AMBRA to CARLOS lasting for 1.6 ms, up

to four events can be stored in the AMBRA chip, so far CARLOS has

to process 4 triplets of incoming signals L0, L1accept and L2accept.

Thus a 15 words deep FIFO is necessary for storing bunch counter and

event counter information concerning 5 consecutive accepted events.

When CARLOS is ready to send a data packet in output, the first 3

trigger words are read and taken to the outmux block. So far a correct

synchronization between data being sent and trigger information is pre-

served. Output flags from the fifo-trigger block empty, almost-full and

full are not used by other blocks as a control since we do not expect to

have a buffer overflow due to the structure of the AMBRA chip.

4.5.8 The event-counter block





– event-id : output 3-bit bus.

A local event counting is performed on CARLOS thanks to the event-

counter block. It is a very simple 3-bit counter triggered by the event-

ident signal coming from the outmux block: this signals asserts that an

event has been completely transmitted and a new one can be accepted.

This number is used both in the header and in the footer words for a

safer transmission protocol.

4.5.9 The outmux block



113



– trigger-data: input 12-bit bus;



– gol-ready : input signal;

– fifo-empty : input 2-bit bus;


– all-fifos-empty : input signal;

– event-id : input 3-bit bus;

– no-input-data: input signal;

– event-identifier : output signal;

– read-data: output 2-bit bus;

– read-trigger : output signal;

– output-strobe: output signal;


The outmux block is a multiplexing unit for sending in output data

coming from the two main processing channels in an interlaced way,

meaning that during the even clock periods data coming from channel

1 are put in output, while during the odd clock periods data coming

from channel 0 are served.

This is the way the outmux block behaves: as soon as data begin to fill

the two FIFO blocks the outmux block begins to put in output a packet

like the one shown in Fig. 4.11. The first 3 16-bit words contain trigger

informations coming from the trigger channel, the first word contains

the bunch counter, while second and third word contain event counter

MSBs and LSBs respectively. Since trigger informations are 12-bit long

they are added the bits 1011 as MSBs in order to be able to recognize

them easily in a later phase of data reconstruction.

Follow two header words containing the local event-id number and the

114


Figure 4.11: CARLOS v3 data transmission protocol

externally hardwired information half-ladder-id. The MSBs from the

header word are 110.

Headers are followed by an even number of data words containing data

from the two main channels: if a channel has not valid data to send,

the MSB is put to 1 and all the other bits are set to 0, meaning that a

dummy data is sent in output, otherwise the MSB is set to 0 meaning

that the data word is valid.

The data packet is then concluded with the transmission of two footer

words containing the same information of the header regarding the

event-id number and the number of words being sent in output. The

MSBs are set to 1, so to uniquely identify the footer word type.

The outmux block puts in output the 16-bit data words and the signal

output-strobe. When this signal is high, CARLOS is transmitting data

belonging to a packet, while when low CARLOS is not sending useful

115


information to the GOL chip. When the gol-ready signal coming from

the GOL chip goes low, meaning that it has lost synchronization with

the input clock, CARLOS stops sending data and begins transmission

again only when gol-ready goes high. The outmux block also puts in

output the 2-bit signal read-data that is sent in input to the 2 main

FIFOs as a pop signal and the signal read-trigger sent to the FIFO-

trigger block. The block outmux also asserts the signal event-ident, that

is used as a trigger for the event-counter block. The input signal all-

fifos-empty is a signal that puts an end to the data packet transmission

since the end of an event has been reached: in fact after the occurrence

of the input signals data-end1 and data-end0 high values, CARLOS

waits until both FIFOs get empty in order to assert the all-fifo-empty

signal. This triggers the end of an event transmission.

4.5.10 The trigger-interface block


– reference-count-trigger : input 8-bit bus;

– L0 : input signal;

– L1accept : input signal;

– L2accept : input signal;

– L2reject : input signal;

– dis-trigger : input signal;



– busy : output signal;

– trigger : output signal;

– abort : output signal.

This block accepts as inputs the trigger signals L0, L1accept, L2accept

and L2reject. Follows a brief description of how these signals can be

116


used for accepting or rejecting an event for storage: the L0 signal is

asserted 1.2 µs after the interaction; L1accept signal is asserted 5.5 µs

after the interaction, if it is not asserted in time the event is rejected;

L2accept is asserted after 100 µs from the interaction if the event is

accepted, otherwise a L2reject signal is asserted before 100 µs. It means

that either a L2accept signal or a L2reject signal is asserted.

The trigger-interface block receives these inputs, processes them to

build 3 other signals: trigger, busy and abort. The trigger signal is

L0 delayed of a quantity of clock cycles programmable via JTAG and

is distributed to the PASCAL and AMBRA chips. This is the signal

triggering an event data acquisition on the PASCAL chip.

The busy signal is asserted just after L0, then waits in the active state

until 5.5 µs after the interaction. If the signal L1accept is not asserted,

then busy goes low again, otherwise it stays active until the signal

dis-trigger coming from AMBRA is activated. The meaning is the

following: until PASCAL is transferring data to AMBRA the readout

system is not ready to accept any other trigger signals, that is to acquire

any other data. The time necessary for the transmission of an event

from PASCAL to AMBRA is about 360 µs. Finally the abort signal

that CARLOS sends to AMBRA is asserted when the L1accept signal is

not asserted at the prefixed time or when the L2reject signal is asserted.

The abort signal causes data transmission from PASCAL to AMBRA

to end and data already stored are discarded.

4.5.11 The cmcu block


– tdi : input signal;

– tms : input signal;

– trst : input signal;

– tck : input signal;

117


Figure 4.12: CMCU logic state diagram

– bist-ok-tcked : input signal;

– bist-failure-tcked : input signal;



– reference-count-trigger : output 8-bit bus;

– tdo: output signal;

– state-tcked : output signal;

– reset-pipe: output signal.

The Command Mode Control Unit (cmcu) is CARLOS internal control

unit remotely controlled via the JTAG port. Serial data coming from

the JTAG pin tdi are packed into 8-bit words and interpreted as a very

simple program containing commands and operands. Fig. 4.12 shows

CARLOS working states reachable using the JTAG port.

At power-on CARLOS is put in an IDLE state in which no calculation

is performed. Then it can be put is a RESET-PIPELINE state in which

118


an internal reset signal is asserted and all registers are initialized. The

following state is the BIST (Built In Self Test) state in which CARLOS

runs an internal test at working speed to check if everything is working

fine or not, then depending on the test results CARLOS enters the

BIST-FAILURE state or BIST-SUCCESS state. In case of success the

8-bit word sent serially as output on tdo is A0, otherwise the word is

55. In the state WRITE-REG CARLOS prepares to write an internal

register with the value read via JTAG in the next state WRITE-REG-

FETCH: this register contains the number of clock cycles of delay to be

applied to the incoming L0 signal before passing it to the AMBRA chip.

If needed, during the READ-REG stage the CARLOS user can read

this value to check that no errors occurred during the writing phase

by means of the tdo output JTAG pin. Then CARLOS can finally

enter the RUNNING stage in which it is able to accept and process

input data streams and to manage the interfaces towards the GOL and

TTCrx chips. When CARLOS is not in RUNNING mode the busy

signal is set high, meaning that no L0 trigger signal is accepted from

the CTP and no data is transmitted to the GOL chip.

4.5.12 The pattern-generator block


– bist-start : input signal;



– data: output 8-bit bus;

– data-valid : output signal;

– data-end : output signal.

The pattern generator block is part of the BIST utility implemented

on CARLOS v3. The BIST [21, 22] is an in-circuit testing scheme for

digital circuits in which both test generation and test verification are

119


done by circuitry built into the chip itself. BIST schemes offer three

attractive advantages:

1. they offer a solution to the problem of testing large integrated

circuits with limited number of I/O pins;

2. they are useful for high speed testing since they can run at design

speed;

3. they do not require expensive external automatic test equipment

(ATE).

BIST schemes, in the most general sense, can have any of the following

characteristics:

– concurrent or non-concurrent operation: concurrent testing is de-

signed to detect faults during normal circuit operation, while non-

concurrent testing requires that normal operation be suspended

during testing. In CARLOS v3 non-concurrent operation has been

chosen since we decided to use BIST only to check the correct be-

havior of the chip when off-line.

– exhaustive or non-exhaustive test design: an exhaustive test of a

circuit requires that every intended state of circuit be shown to

exist and that all transitions be demonstrated. For large sequen-

tial circuits as CARLOS this is not practical, so we decided to

implement a non-exhaustive testing design.

– deterministic or pseudo-random generation of test vectors: deter-

ministic testing occurs when specific produced vectors have to be

applied, while pseudorandom testing occurs when random-like test

vectors are produced. We chose the pseudo-random generation

since its implementation requires much less area than the deter-

ministic generation. Pseudo-random generation on CARLOS v3

is performed by the pattern generator block.

The pattern generator block provides a set of 200 pseudo-random test

vectors for BIST. These vectors are provided at the same time to both

120


processing channels. The pseudo-random sequence is obtained using

a linear feed-back shift register, that is a very simple structure and it

requires a very small on-chip area.

4.5.13 The signature-maker block


– bist-vector : input 16-bit bus;



– bist-strobe: output signal;

– signature: output 16-bit bus.

The signature maker block performs the signature analysis. In sig-

nature analysis, the test responses of a system are compacted into a

signature using a linear feedback shift register (LFSR). Then the signa-

ture of the device under test is compared with the expected (reference)

signature. If they both match, the device is declared fault free, other-

wise it is declared faulty. Since several thousands of test responses are

compacted into a few bits of signature by a LFSR, there is an informa-

tion loss. As a result some faulty devices may have the same correct

signature. The probability of a faulty device having the same signature

of a working device is called the probability of aliasing. The probabil-

ity of aliasing is shown to be approximately 2−m, where m denotes the

number of bits in the signature.

The signature register implemented on CARLOS is 16 bits wide, so the

probability of aliasing is 2−16. The signature maker block takes the

16-bit bist-vector word coming from the outmux block, performs the

signature analysis, then, when the FIFO have been completely emp-

tied, asserts the bist-strobe signal when the signature value is ready.

121



4.6 Digital design flow for CARLOS v3

Fig. 4.13 shows in some details the digital design flow we have used for

the design of CARLOS v3 with the CERN library 0.25 µm. Since it is

quite a recent library, we had to face some problems: for instance the

small number of standard cells, the lack of 3-state buffers, the lack of

worst-case cell models, the fact that only Verilog models for cells and

not VHDL models were provided and so on.

The reason for these lacks has to be searched in the fact that up to now

very few chips have been realized and tested using this library, so not

so much characterization work could be done.

So far we had to learn how to use the software Cadence Verilog XL for

122

4.7 — CARLOS layout features

post-synthesis simulations, since Synopsys allows to simulate VHDL

models only. Our main difficulty was due to the necessity of using

VHDL-written testbenches for logic simulation and Verilog-written ones

for netlist simulation: this can be very error-prone since it is quite dif-

ficult to exactly match the two models together.

Beside that we had to learn how to use Cadence Silicon Ensemble for

the place and route job. This is really a very difficult job when the

standard cells are not completely characterized. We received a great

help from Marchioro group especially for what concerns the back-end

design flow. They suggested us to follow a completely flat approach to

the problem since the chip is very small: the hierarchical approach, i.e.

design the layout of each block and then route them together is only

worthy when dealing with chip complexities one order of magnitude

greater then ours.

4.7 CARLOS layout features

Fig. 4.14 shows a picture of the final layout of CARLOS v3, as it has

been sent to the foundry. As one can easily observe it is pad-limited,

i.e. the total silicon surface is due to the number of I/O pads (100)

and not to the number of standard cells it contains. Adding some extra

logic would not imply any additional cost if contained in the area that

is now empty. So far we hope that adding the 2D compression logic will

not substantially increase the chip area and, consequently, production

cost. The total area is 16 mm2 corresponding to the minimal size the

silicon wafer was divided into.

CARLOS v3 is fairly a very simple chip if compared to CARLOS v2

with its 300 kgates of logical complexity: in fact it contains only 10

Kgates. Nevertheless it has been designed in order to test our approach

to the new library and to verify that we were able to run through all

the design flow steps. Our final check will be the test of the chip itself

in order to verify that everything was correctly designed, so to have

123


Figure 4.14: CARLOS v3 layout picture

very clear ideas for the design of the final version of CARLOS.

A specific PCB is in the design phase right now: it will contain only

the connectors for probing with the Tektronics pattern generator and

logic analyzer pods and the chip itself. Differently from CARLOS v2,

the chip will be bonded into a PGA package and inserted on the PCB

using a ZIF socket. This will allow us to test the 100 samples of the

chip by using only a few PCB samples.

124

Chapter 5

Wavelet based compression

algorithm

As an alternative to the 1D and 2D compression algorithms conceived

at the INFN Section of Torino, our group in Bologna decided to study

other compression algorithms that may be used as a second level com-

pressor on SDD data. After studying the main standard compression

algorithms, we decided to focuse on a wavelet-based compression algo-

rithm and its performances when used to compress SDd data.

The wavelet based compression algorithm design can be divided in 4

steps, requiring the use of different software tools:

1. choice of the algorithm main features;

2. optimization of the algorithm with respect to SDD data using the

Matlab Wavelet Toolbox [23];

3. choice of the architecture for the implementation of the algorithm

using Simulink [24];

4. comparison between the wavelet algorithm performances and the

ones implemented on CARLOS prototypes, in terms of compres-

sion ratio and reconstruction error.

125

Wavelet based compression algorithm

5.1 Wavelet based compression algorithm

The idea of compressing SDD data using a multiresolution based com-

pression algorithm comes from the growing success of this technique,

both for uni-dimensional and bi-dimensional signal compression.

Multiresolution analysis gives an equivalent representation of an input

signal in terms of approximation and detail coefficients; these coef-

ficients can then be encoded using standard techniques, such as run

length encoding.

An SDD event, i.e. data coming from a half-SDD, can be analyzed as

a unidimensional data stream of 64k samples or as a bi-dimensional

structure of 256 by 256 elements. So far the first choice we have to

take is whether implementing a 1D or 2D multiresolution analysis.

In 1D analysis the signal can be written as:

S =

s1, s2, . . . , s256︸︷︷︸1o anode

, s257, s258, . . . , s512︸︷︷︸2o anode

, . . . , s65281, s65282, . . . , s65536︸︷︷︸256o anode

(5.1)

In 2D analysis the signal can be written as:

S =

s1,1 s1,2 . . . s1,256

s2,1 s2,2 . . . s2,256

......

. . ....

s256,1 s256,2 . . . s256,256

1o anode

2o anode

...

256o anode

(5.2)

In the case of 1D analysis, once chosen the two decomposition filters

H and G, the multiresolution analysis can be applied with a number

of levels, that is the number of cascadable filters, between 1 and 16.

So far an orthogonal wavelet decomposition C with 64k coefficients is

produced: the ratio of the approximation coefficients ai number to the

detail coefficients di number depends on the number of decomposition

126

5.1 — Wavelet based compression algorithm

levels used:

S =

(s1, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . , s65536

)0 decomposition levels

C =

a1, . . . . . . . . . , a32768︸︷︷︸coeffs. app.

, d32769, . . . . . . . . . , d65536︸︷︷︸coeffs. dett.

1 decomposition level

C =

a1, . . . . . . , a16384︸︷︷︸coeffs. app.

, d16385, . . . . . . . . . . . . , d65536︸︷︷︸coeffs. dett.

2 decomposition levels

C =

a1, . . . , a8192︸︷︷︸coeffs. app.

, d8193, . . . . . . . . . . . . . . . . , d65536︸︷︷︸coeffs. dett.


......

C =

a1, a2, a3, a4︸︷︷︸coeffs. app.

, d5, . . . . . . . . . . . . . . . . . . . , d65536︸︷︷︸coeffs. dett.


C =

a1, a2︸︷︷︸coeffs. app.

, d3, . . . . . . . . . . . . . . . . . . . . . . , d65536︸︷︷︸coeffs. dett.


C =

a1︸︷︷︸coeff. app.

, d2, . . . . . . . . . . . . . . . . . . . . . . , d65536︸︷︷︸coeffs. dett.


In the case of 2D analysis, once chosen the two decomposition filters

H and G, the bi-dimensional decomposition scheme is applied with a

number of levels to be chosen between 1 and 8. First, multiresolution

analysis is applied to each row of the 2D signal, then each column result-

ing from the previous analysis is decomposed using the same number

of levels.

So far the 2D signal (5.2) is transformed into the 2D orthogonal wavelet

decomposition, containing 64k coefficients; even in this case the ratio

of the approximation coefficients number to detail coefficients number

127


depends on the decomposition levels applied:

S =

s1,1 . . . . . . . . . . . . . . . . . . . . . . . . . s1,256

......

s256,1 . . . . . . . . . . . . . . . . . . . . . . . . . s256,256


C =

a1,1 . . . a1,128 d1,129 . . . d1,256

......

......

a128,1 . . . a128,128 d128,129 . . . d128,256

d129,1 . . . d129,128 d129,129 . . . d129,256

......

......

d256,1 . . . d256,128 d256,129 . . . d256,256


......

C =

a1,1 a1,2 d1,3 . . . . . . . . . . . d1,256

a2,1 a2,2 d2,3 . . . . . . . . . . . d2,256

d3,1 d3,2 d3,3 . . . . . . . . . . . d3,256

......

......

d256,1 d256,2 d256,3 . . . . . . . . . . . d256,256


C =

a1,1 d1,2 . . . . . . . . . . . . . . . . . . d1,256

d2,1 d2,2 . . . . . . . . . . . . . . . . . . d2,256

......

...

d256,1 d256,2 . . . . . . . . . . . . . . . . . . d256,256


Applying multiresolution analysis to SDD data proves to be useful since

approximation coefficients feature high values, since they represent the

signal approximation, while detail coefficients feature values near to 0.

So far, in order to get compression, detail coefficients can be eliminated

without losing significant information on the input signal.

An easy and effective technique for compressing data after multireso-

lution analysis is to put a threshold level over every coefficient ai and

128

5.2 — Multiresolution algorithm optimization

di. What we expect is that approximation coefficients ai remain un-

changed, while detail coefficients di are all put to 0. This is useful since

the long zero sequences coming from the detail coefficients can be fur-

ther compressed using the run length encoding technique.

The multiresolution based compression algorithm described so far is a

lossy technique but it can be used in a lossless way without putting the

threshold on wavelet coefficients.

5.1.1 Configuration parameters of the multireso-

lution algorithm

Some algorithm parameters can be tuned in order to get the best perfor-

mances in terms of compression ratio and reconstruction error. These

parameters are:

– the pair of decomposition filters H and G, used to implement the

multiresolution analysis;

– the number of dimensions used for the analysis: 1D or 2D;

– the number of decomposition levels;

– the threshold value applied to ai and di coefficients.

5.2 Multiresolution algorithm optimization

The multiresolution algorithm optimization has been carried out using

the Wavelet Toolbox from Matlab.

First, the pair of decomposition filters that, with a fixed value of the

threshold, gives the higher number of null coefficients ai and di and the

lower reconstruction error has been chosen; then the other 3 parameters

have been evaluated one after the other for optimization.

129


5.2.1 The Wavelet Toolbox from Matlab

The Wavelet Toolbox is a collection of functions from Matlab that,

using Matlab line commands and a user-friendly graphical interface,

allows to develop wavelet techniques to be applied to real problems.

In particular the Wavelet Toolbox allowed us to:

– perform the multiresolution analysis of a signal and the corre-

sponding synthesis, using a wide variety of decomposition and

reconstruction filters;

– treat signals as uni-dimensional or bi-dimensional;

– analyze signals on a variable number of levels;

– apply different threshold levels to the coefficients obtained ai and

di.

The wide choice of filters corresponds to the wide number of wavelet

families implemented by the Wavelet Toolbox, shown in Tab. 5.1 and

in Fig. 2.10, Fig. 2.11 and Fig. 2.12.

In particular the Haar family is composed by the wavelet function ψ(x)

Family Name identifier

Haar wavelet ’haar’

Daubechies wavelets ’db’

Symlets ’sym’

Coiflets ’coif’

Biorthogonal wavelets ’bior’

Reverse Biorthogonal wavelets ’rbio’

Table 5.1: Wavelet families used for multiresolution analysis

and its corresponding scale function φ(x), already discussed in Chap-

ter 2. On the other side each Daubechies, Symlets e Coiflets family is

composed by more than a pair of functions ψ(x) and φ(x): Daubechies

family pairs are named db1, . . . , db10, Symlets family pairs are named

sym2, . . . , sym8, while Coiflets family pairs are named coif1, . . . , coif5.

130


Biorthogonal (bior1.1, . . . , bior6.8) and Reverse Biorthogonal (rbio1.1,

. . . , rbio6.8) are composed by quartets of functions ψ1(x), φ1(x), ψ2(x)

and φ2(x), where, the first pair is used for decomposition and the second

for reconstruction. Using a particular function of the Wavelet Toolbox

which requires the name of the pair of functions ψ(x) and φ(x) chosen

or the name of the quartet ψ1(x), φ1(x), ψ2(x) and φ2(x) when using

Biorthogonal and Reverse Biorthogonal, it is possible to determine the

impulse response representing, respectively, the low pass filter H and

the high pass filter G used for decomposition and the low pass filter H

and high pass filter G, used in the reconstruction stage.

Multiresolution analysis and synthesis are computed as described in

Chapter 3: in particular the analysis step is performed with a con-

volution operation between the input signal and the filters H and G,

followed by decimation, while synthesis is performed with up-sampling,

followed by a convolution operation between the signal and the filters

H and G.

5.2.2 Choice of the filters

In order to choose the best filters H, G, H and G for SDD data com-

pression, 10 64-kbytes SDD events have ben analyzed using the Wavelet

Toolbox using the wavelet families shown in Tab.5.1.

Each signal S, interpreted both as unidimensional as in in Fig. 5.1 and

bi-dimensional as in Fig. 5.2, has been processed in the following way:

– after choosing a pair of functions ψ(x) and φ(x) or the quartet

ψ1(x), φ1(x), ψ2(x), φ2(x), the corresponding filter coefficients H ,

G, H and G have been determined;

– the signal S has been analyzed using the filters H and G obtaining

the decomposition coefficients C;

– a threshold th has been applied to the coefficients C, obtaining

the modified coefficients Cth;

131


1 2 3 4 5 6

x 104

−20

0

20

d1

−40−20

02040

d2

−50

0

50

d3

−20

0

20

d4

−10

0

10

d5

10

20

30

a5

0

50

100

150

s

Decomposition at level 5 : s = a5 + d5 + d4 + d3 + d2 + d1 .

Figure 5.1: Uni-dimensional analysis on 5 levels of the signal S

– the coefficients Cth have been synthesized into the signal R, using

the filters H and G.

Both in the uni-dimensional and in the bi-dimensional case, the perfor-

mances related to compression have been quantified using the percent-

age P of the number of null coefficients in Cth, while the performances

related to the reconstruction error have been quantified using the root

mean square error E between the original signal S and the signal R,

obtained after the analysis and synthesis of Cth.

In particular, since the total number of elements in Cth is 65536, in

the uni-dimensional case, the parameter P can be expressed in the

following way:

P =100 · (number of null coefficients in Cth)

65536(5.3)

Even the total number of elements in S and in R is 65536, so, if si

is the i-th element of the uni-dimensional signal S and ri is the i-th

132


Approximation coef. at level 5

Decomposition at level 5

Image Selection

Original Image

50 100 150 200 250

50

100

150

200

250

Synthesized Image

dwt

idwt

Figure 5.2: Bi-dimensional analysis on 5 levels of the signal S

element of R, the parameter E can be expressed in the following way:

E =

√√√√ 1

65536

65536∑i=1

(si − ri)2 (5.4)

In the bi-dimensional case P is calculated in the same way while, nam-

ing si,j as the (i, j)-th element of S and ri,j as the (i, j)-th element of

R, the parameter E can be expressed in the following way:

E =

√√√√ 1

65536

256∑i=1

256∑j=1

(si,j − ri,j)2 (5.5)

Even if the parameters P and E cannot be directly comparable to

the results obtained in the compression algorithms implemented on the

CARLOS prototypes, they give an important indication about the per-

formance of each filter set used during the analysis.

In particular, P gives a rough estimation of how much the coefficients

Cth can be compressed using the run length encoding, while E can

133


be interpreted as the error introduced in the value associated to each

sample coming the SDD. The analysis results related to 10 SDD events

are shown from Tab. 5.2 to Tab. 5.7. In particular, Tab. 5.2 shows

the parameter P and E values related to a 5-level analysis using the

Haar filter, both in 1D and 2D, with a threshold value th variable in

the range 0-25. The other tables show the P and E values obtained

with a 5-level analysis with a threshold th of 25 using filters belong-

ing to Daubechies (Tab. 5.3), Symlets (Tab. 5.4), Coiflets (Tab. 5.5),

Biorthogonal (Tab. 5.6) and Reverse Biorthogonal (Tab. 5.7) families,

in the 1D and 2D cases. The uncertainties ∆P and ∆E have been

reported in terms of the respective orders of magnitude only, since we

are only looking for an estimation of these values.

An intersting feature emerging from Tab. 5.2 is the progressive increase

of the values P and E with the increase of the threshold values th ap-

plied to the coefficients C.

The trend of P is easy to understand considering that, applying the

threshold th to decomposition coefficients C means putting to 0 all co-

efficients less than th in absolute value: so far the greater the th value,

the greater the parameter P value.

For what concerns E, the greater the th value, the greater the differ-

ences between Cth and the original C and the distortion introduced.

It is to be noticed that for a value of th equal to 0, the parameter

P is 9.12, while the parameter E is 1.26 e-14, that is the percentage

of null coefficients in Cth and the reconstruction error are very small.

This is quite easy to understand for what concerns P since, without

a threshold, the only null coefficients are a very small fraction of the

total number. For what concerns E, avoiding to modify the coefficients

C with the threshold assures a nearly perfect reconstruction of the sig-

nal. The value 1.26 e-14 comes from the finite precision of the machine

performing the analysis and synthesis processes.

134


Haar

1D 2D

Threshold value th P E P E

0 9.12 1.26 e-14 3.68 2.50 e-14

1 24.68 0.27 22.21 0.28

2 40.01 0.63 42.63 0.75

3 58.60 1.64 56.34 1.19

4 67.08 1.71 67.76 1.67

5 75.56 2.09 75.50 2.09

6 79.87 2.38 80.77 2.44

7 83.56 2.68 84.96 2.77

8 86.71 2.99 88.21 3.08

9 88.82 3.23 90.75 3.36

10 90.70 3.48 92.88 3.63

11 92.21 3.72 94.49 3.87

12 93.20 3.89 95.80 4.08

13 94.16 4.07 96.78 4.26

14 94.81 4.21 97.56 4.42

15 95.33 4.34 98.20 4.57

16 95.72 4.44 98.73 4.71

17 96.03 4.54 99.05 4.80

18 96.20 4.60 99.25 4.86

19 96.41 4.67 99.44 4.93

20 96.54 4.72 99.55 4.97

21 96.62 4.76 99.64 5.01

22 96.69 4.79 99.69 5.03

23 96.73 4.81 99.74 5.05

24 96.76 4.83 99.77 5.07

25 96.79 4.85 99.80 5.09

Table 5.2: Mean values of P and E on 10 SDD events (∆P ≈ ∆E ≈ 0.01):

the analysis has been performed on a 5-level base, using the set of filters

Haar derived from the Haar wavelet.

135


Daubechies

1D 2D

Filters P E P E

db1 96.79 4.85 99.80 5.09

db2 96.75 4.82 99.63 5.08

db3 96.73 4.81 99.54 5.07

db4 96.73 4.81 99.48 5.07

db5 96.72 4.81 99.33 5.07

db6 96.71 4.81 99.27 5.07

db7 96.72 4.82 99.20 5.07

db8 96.70 4.81 99.08 5.08

db9 96.69 4.81 98.98 5.09

db10 96.68 4.80 98.98 5.09

Table 5.3: Mean values of P ed E on 10 SDD events (∆P ≈ ∆E ≈ 0.01):


Daubechies and a threshold level th equal to 25; the values obtained with

db1 are equivalent to the ones obtained with Haar, since the corresponding

filters are equivalent.

Symlets

1D 2D

Filters P E P E

sym2 96.75 4.82 99.63 5.08

sym3 96.73 4.81 99.54 5.07

sym4 96.74 4.82 99.43 5.07

sym5 96.72 4.81 99.38 5.06

sym6 96.73 4.81 99.33 5.07

sym7 96.70 4.80 99.17 5.06

sym8 96.71 4.80 99.11 5.08



Symlets and a threshold value th equal to 25.

136


Coiflets

1D 2D

Filters P E P E

coif1 96.74 4.82 99.51 5.07

coif2 96.72 4.80 98.32 4.75

coif3 96.72 4.81 99.60 5.06

coif4 96.69 4.80 98.62 5.06

coif5 96.68 4.80 98.29 5.05

Table 5.5: Mean values of P and E on 10 SDD events (∆P ≈ ∆E ≈ 0.01):


Coiflets and a threshold value th equal to 25.

Biorthogonal

1D 2D

Filters P E P E

bior1.1 96.79 4.85 99.80 5.09

bior1.3 96.68 4.81 99.48 5.07

bior1.5 96.64 4.82 99.25 5.05

bior2.2 96.28 4.71 98.70 4.94

bior2.4 96.28 4.65 98.56 4.92

bior2.6 96.23 4.62 98.27 4.91

bior2.8 96.21 4.63 97.81 4.91

bior3.1 93.41 5.68 94.15 5.58

bior3.3 94.37 4.84 95.43 5.01

bior3.5 94.70 4.65 96.60 5.10

bior3.7 94.81 4.59 95.13 4.85

bior3.9 94.88 4.56 94.13 4.85

bior4.4 96.75 4.82 99.39 5.07

bior5.5 96.78 4.88 99.46 5.10

bior6.8 96.68 4.79 98.95 5.04

Table 5.6: Mean values of P and E using the Biorthogonal filters

137


Reverse Biorthogonal

1D 2D

Filters P E P E

rbio1.1 96.79 4.85 99.80 5.09

rbio1.3 96.77 4.85 99.57 5.08

rbio1.5 96.75 4.86 99.39 5.06

rbio2.2 96.78 4.92 96.89 4.58

rbio2.4 96.79 4.88 99.47 5.12

rbio2.6 96.77 4.87 99.32 5.11

rbio2.8 96.78 4.88 99.18 5.12

rbio3.1 96.38 8.67 98.76 11.29

rbio3.3 96.72 5.14 99.29 5.39

rbio3.5 96.76 4.95 99.28 5.18

rbio3.7 96.76 4.92 99.09 5.18

rbio3.9 96.74 4.91 98.97 5.20

rbio4.4 96.68 4.80 99.29 5.06

rbio5.5 93.32 4.63 98.56 4.92

rbio6.8 96.71 4.81 99.10 5.08

Table 5.7: Mean values of P ed E on 10 SDD events (∆P ≈ ∆E ≈0.01): the analysis has been performed on a 5-level base, using a set of

filters Rev. Biorthogonal and a threshold value th equal to 25; the values

obtained with bior1.1 are equivalent to the ones obtained with Haar, since

the corresponding filters are equivalent.

The common feature from Tab. 5.3, Tab. 5.4, Tab. 5.5, Tab. 5.6 and

Tab. 5.7 is the increasing value of P and E with the increase of the th

value.

Nevertheless some wavelet families are better suited than others to the

compression task; by comparing the values obtained for th = 25, it is

evident that the Haar set of filters shows the best performances. In

particular with P = 96.79 and E = 4.85 in the uni-dimensional case

and P = 99.80 and E = 5.09 in the bi-dimensional case, the Haar set

of filters gets the higher percentage of null coefficients with an accept-

138


Family Set of filters name Filter length

Haar haar 2

Daubechies dbN 2N

Symlets symN 2N

Coiflets coifN 6N

Biorthogonal bior1.1 2

biorN1.N2, N1 6=1,N2 6=1 max(2N1,2N2)+2

Reverse Biorthogonal rbio1.1 2

rbioN1.N2, N1 6=1,N2 6=1 max(2N1,2N2)+2

Table 5.8: Length of filters belonging to different families

able error. The choice of the Haar filters can be supported with other

argomentations too, concerning Haar filter’s length H, G, H and G,

i.e. the number of coefficients which characterize the impulse response.

As shown in Tab. 5.8 filters belonging to the Haar family have the

smallest number of coefficients among filters, obviously together with

the set of filters db1, bior1.1 and rbio1.1. Since the analysis and syn-

thesis processes consist of successive convolutions between the signal

to analyze or synthesize and the respective filters, this small number

of coefficients allows for a higher execution speed of the analysis and

synthesis processes.

5.2.3 Choice of the dimensionality, number of lev-

els and threshold value

Once chosen the Haar set of filters, we studied the effect on the P and E

parameters of dimensionality (1D or 2D), the number of levels used for

decomposition (1,2, . . . ,16 in 1D and 1,2, . . . ,8 in 2D) and the value

of the threshold th.

Tab. 5.9 and Tab. 5.10 show the analysis of the usual 10 SDD events in

139


1D and 2D; each table also contains the value of P and E for 1, 3 and

5 levels of decomposition and for each level a threshold value between

0 and 25 has been adopted.

The first result is that bi-dimensional analysis produces a higher per-

centage P of null coefficients than the uni-dimensional case; neverthe-

less its E values are also higher.

For instance comparing the P and E values for a threshold value th

of 35 the 1D analysis on 1 level determines P = 50.01 and E = 1.85,

while 2D analysis determines P = 74.96 and E = 3.96; the same 1D

analysis on 3 levels determines P = 87.45 and E = 4.18, versus the

values P = 99.80 and E = 5.09 in the 2D case.

An other result we obtained from the tables is that, once decided

whether to use 1D or 2D analysis, an increase in the number of decom-

position levels determines an increase in the values of the parameters

P and E.

For instance, by comparing values in Tab. 5.9 obtained with th equal to

25, it can be noticed that 1D analysis on 1 level determines P = 50.01

and E = 1.85, on 2 levels P = 87.45 and E = 4.18, while on 3 levels

P = 96.79 and E = 4.85. The same concept holds true for 2D analysis

and synthesis. So far we found out that the optimized version of a

multiresolution analysis based algorithm for SDD data is a 2D analysis

on the maximum number of decomposition levels using the Haar set of

filters.

For what concerns the threshold th, the parameters P and E increase

when th is increased. In order to decide the th value we have to able

to quantify the reconstruction error introduced after wavelet analysis

and to compare it with the compression algorithms implemented on

CARLOS.

140

5.3 — Choice of the architecture

5.3 Choice of the architecture

The precision related to the architecture chosen for the implementation

of the multiresolution analysis can strongly affect the percentage P of

null coefficients and the reconstruction error E. As an example it is

sufficient to apply both the analysis and synthesis processes to an input

signal without any threshold : the reconstruction error E, though very

little, is different from 0, due to the finite precision that our Pentium

II processor used to perform the calculations.

In order to quantify the influence of the architecture on the algorithm

performance we used Simulink, a software tool from Matlab for the

design and simulation of complex systems, and Fixed-Point Blockset

[25] that allows to simulate the performances of a given algorithm when

implemented on different architectures, both in fixed and floating point.

5.3.1 Simulink and the Fixed-Point Blockset

The Fixed-Point Blockset tool [25] is one of the Simulink libraries which

contains blocks performing operations between signals such as sum,

multiplication, convolution and so on, simulating various types of ar-

chitectures, both fixed and floating point. This tool is very useful since

it allows the designer to study the performance of a given algorithm on

different architectures before the actual implementation takes place.

For instance, this tool can be successfully used in order to decide if

a Fourier transform can be implemented with acceptable performance

in a fixed-point DSP (Digital Signal Processor) or it has to be imple-

mented in a floating-point DSP. The difference is relevant especially for

cost reasons, since a floating-point DSP has a much higher cost than

a fixed-point one. We used the Fixed-Point Blockset with the same

purpose of finding the more suitable architecture before actual imple-

mentation.

Among the various floating and fixed-point architectures handled by

141


the Fixed-Point Blockset, we studied the following ones:

– double precision floating point IEEE 754 standard architecture;

– single precision floating point IEEE 754 standard architecture;

– fractional fixed point.

IEEE 754 standard architecture is one of the most widespread archi-

tectures and it is used in most floating-point processors.

When the double precision is used, the standard architecture requires

a 64-bit word in which 1 bit stands for the sign s, 11 bits for the expo-

nent e and the remaining 52 bits for the mantissa m. The relationship

s e m

b b b b63 62 51 0

between binary and decimal representation is the following one:

valore decimale = (−1)s · (2e−1023)(1.m) , 0 < e < 2047 (5.6)

When the single precision is used, the standard requires a 32-bit word

in which 1 bit stands for the sign s, 8 bits for the exponent e and the

remaining 23 bits for the mantissa m:

s e

b b b b31 30 22 0

m

In this case the relationship between binary and decimal representation

is the following one:

valore decimale = (−1)s · (2e−127)(1.m) , 0 < e < 255 (5.7)

For what concerns the fractional fixed-point architecture, once fixed

the position of the radix point among the 32 bits of the word, the bits

142


on the right (b0− bs−1) contain the fractionary part of the number, one

bit on the left (bs) contains the sign of the number and the other guard

bits (bs+1 − b31) on the left of the radix point contain the integer part

of the number.

It is to be noticed that double precision floating point IEEE 754 stan-

31 30 sb bs−1 1b 0bb bs+1b

radix pointguard bits

dard architecture features a precision of 2−52 ≈ 10−16, single precision

IEEE 754 has a precision of 2−23 ≈ 10−7, while fractional fixed point

architecture has a precision of 2−s, i.e. the precision depends on the

number of bits being used for the fractional part of the number. So

far the study of the influence of the fixed fractional architecture on the

multiresolution analysis has been carried on by varying the position of

the radix point among the 32 bit word.

5.3.2 Choice of the architecture

Implementing bi-dimensional multiresolution analysis and synthesis us-

ing Simulink is quite a long job, both in terms of design and simulation

time. So far we decided to implement a uni-dimensional algorithm on

16 decomposition levels, since it is a much quicker and simpler job.

Beside that it gives a rather good estimation on the performances of

the 3 architectures on an algorithm very similar to the one we have

chosen.

The implementation with Simulink of the multiresolution analysis

and synthesis processes is shown in the external blocks in Fig.5.3: the

block on the left performs the 1D analysis of the signal S using the

Haar set of filters, while the block on the right applies a threshold on

the decomposition coefficients and performs the synthesis of the signal

143


Applic

azi

one

soglia

e16

livelli

dis

inte

si

R

Segnale

Ric

ost

ruito

S

Segnale

D1

D2

D3

D4

D5

D6

D7

D8

D9

D1

0

D11

D1

2

D1

3

D1

4

D1

5

D1

6

A1

6

D1

de

l

D2

de

l

D3

de

l

D4

de

l

D5

de

l

D6

de

l

D7

de

l

D8

de

l

D9

de

l

D1

0d

el

D11

de

l

D1

2d

el

D1

3d

el

D1

4d

el

D1

5d

el

D1

6d

el

A1

6d

el

Dela

y

Se

gn

ale

D1

D2

D3

D4

D5

D6

D7

D8

D9

D1

0

D11

D1

2

D1

3

D1

4

D1

5

D1

6

A1

6

16

livelli

dia

nalis

i

D1

D2

D3

D4

D5

D6

D7

D8

D9

D1

0

D11

D1

2

D1

3

D1

4

D1

5

D1

6

A1

6

Se

gn

ale

Ric

ost

ruito

Figure 5.3: Developed Simulink blocks: from left to right the analysis

block, the delay block and the threshold and synthesis block

144


Dettaglio

1

Dettaglio

10

Dettaglio

11

Dettaglio

12

Dettaglio

13

Dettaglio

14

Dettaglio

15

Dettaglio

16

Appro

ssim

azio

ne

16

Dettaglio

2

Dettaglio

3

Dettaglio

4

Dettaglio

5

Dettaglio

7

Dettaglio

8

Dettaglio

9

Dettaglio

6

17

A16

16

D16

15

D15

14

D14

13

D13

12

D12

11

D11

10

D10

9 D9

8 D8

7 D7

6 D6

5 D5

4 D4

3 D3

2 D2

1 D1

Low

_D

ec

Filt

er9

Low

_D

ec

Filt

er8

Low

_D

ec

Filt

er7

Low

_D

ec

Filt

er6

Low

_D

ec

Filt

er5

Low

_D

ec

Filt

er4

Low

_D

ec

Filt

er3

Low

_D

ec

Filt

er2

Low

_D

ec

Filt

er1

5

Low

_D

ec

Filt

er1

4

Low

_D

ec

Filt

er1

3

Low

_D

ec

Filt

er1

2

Low

_D

ec

Filt

er1

1

Low

_D

ec

Filt

er1

0

Low

_D

ec

Filt

er1

Low

_D

ec

Filt

er

Hi_

Dec

Filt

er9

Hi_

Dec

Filt

er8

Hi_

Dec

Filt

er7

Hi_

Dec

Filt

er6

Hi_

Dec

Filt

er5

Hi_

Dec

Filt

er4

Hi_

Dec

Filt

er3

Hi_

Dec

Filt

er2

Hi_

Dec

Filt

er1

5

Hi_

Dec

Filt

er1

4

Hi_

Dec

Filt

er1

3

Hi_

Dec

Filt

er1

2

Hi_

Dec

Filt

er1

1

Hi_

Dec

Filt

er1

0

Hi_

Dec

Filt

er1

Hi_

Dec

Filt

er

2

Dow

nsam

ple

9

2

Dow

nsam

ple

8

2

Dow

nsam

ple

7

2

Dow

nsam

ple

6

2

Dow

nsam

ple

5

2

Dow

nsam

ple

4

2

Dow

nsam

ple

31

2

Dow

nsam

ple

30

2

Dow

nsam

ple

3

2

Dow

nsam

ple

29

2

Dow

nsam

ple

28

2

Dow

nsam

ple

27

2

Dow

nsam

ple

26

2

Dow

nsam

ple

25

2

Dow

nsam

ple

24

2

Dow

nsam

ple

23

2

Dow

nsam

ple

22

2

Dow

nsam

ple

21

2

Dow

nsam

ple

20

2

Dow

nsam

ple

2

2

Dow

nsam

ple

19

2

Dow

nsam

ple

18

2

Dow

nsam

ple

17

2

Dow

nsam

ple

16

2

Dow

nsam

ple

15

2

Dow

nsam

ple

14

2

Dow

nsam

ple

13

2

Dow

nsam

ple

12

2

Dow

nsam

ple

11

2

Dow

nsam

ple

10

2

Dow

nsam

ple

1

2

Dow

nsam

ple

1

Segnale

De

tta

glio

66 D6

Lo

w_

De

cF

ilte

r5

Hi_

De

cF

ilte

r5

2

Do

wn

sa

mp

le11

2

Do

wn

sa

mp

le1

0

Figure 5.4: Zoom on the developed analysis block

145


Ap

pro

ssim

azi

on

e

De

tta

gli

16

live

llid

irico

str

uzio

ne

1

Se

gn

ale

Ric

ostr

uito

In1

In2

In3

In4

In5

In6

In7

In8

In9

In10

In11

In12

In13

In14

In15

In16

In17

Out1

Out2

Out3

Out4

Out5

Out6

Out7

Out8

Out9

Out1

0

Out1

1

Out1

2

Out1

3

Out1

4

Out1

5

Out1

6

Out1

7

ToW

ork

spa

ce

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

D11

D12

D13

D14

D15

D16

A16

D1

th

D2

th

D3

th

D4

th

D5

th

D6

th

D7

th

D8

th

D9

th

D10

th

D11

th

D12

th

D13

th

D14

th

D15

th

D16

th

A16

th

Ap

plic

azi

on

eso

glia

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

D11

D12

D13

D14

D15

D16

A16

Segnale

Ric

ost

ruito

17

A1

6

16

D1

6

15

D1

5

14

D1

4

13

D1

3

12

D1

2

11 D11

10

D1

0

9 D9

8 D8

7 D7

6 D6

5 D5

4 D4

3 D3

2 D2

1 D1

Figure 5.5: Zoom on the developed threshold and synthesis block

146


Dettaglio

10

Dettaglio

1

Dettaglio

11

Dettaglio

12

Dettaglio

13

Dettaglio

14

Dettaglio

15

Dettaglio

16

Appro

ssim

azio

ne

16

Dettaglio

9

Dettaglio

8

Dettaglio

7

Dettaglio

6

Dettaglio

5

Dettaglio

4

Dettaglio

3

Dettaglio

2

1

Segnale

Ric

ostr

uito

2

Upsam

ple

9

2

Upsam

ple

8

2

Upsam

ple

7

2

Upsam

ple

6

2

Upsam

ple

5

2

Upsam

ple

4

2

Upsam

ple

31

2

Upsam

ple

30

2

Upsam

ple

3

2

Upsam

ple

29

2

Upsam

ple

28

2

Upsam

ple

27

2

Upsam

ple

26

2

Upsam

ple

25

2

Upsam

ple

24

2

Upsam

ple

23

2

Upsam

ple

22

2

Upsam

ple

21

2

Upsam

ple

20

2

Upsam

ple

2

2

Upsam

ple

19

2

Upsam

ple

18

2

Upsam

ple

17

2

Upsam

ple

16

2

Upsam

ple

15

2

Upsam

ple

14

2

Upsam

ple

13

2

Upsam

ple

12

2

Upsam

ple

11

2

Upsam

ple

10

2

Upsam

ple

1

2

Upsam

ple

Low

_R

ec

Filt

er9

Low

_R

ec

Filt

er8

Low

_R

ec

Filt

er7

Low

_R

ec

Filt

er6

Low

_R

ec

Filt

er5

Low

_R

ec

Filt

er4

Low

_R

ec

Filt

er3

Low

_R

ec

Filt

er2

Low

_R

ec

Filt

er1

6

Low

_R

ec

Filt

er1

5

Low

_R

ec

Filt

er1

4

Low

_R

ec

Filt

er1

3

Low

_R

ec

Filt

er1

2

Low

_R

ec

Filt

er1

1

Low

_R

ec

Filt

er1

0

Low

_R

ec

Filt

er1

Hi_

Rec

Filt

er9

Hi_

Rec

Filt

er8

Hi_

Rec

Filt

er7

Hi_

Rec

Filt

er6

Hi_

Rec

Filt

er5

Hi_

Rec

Filt

er4

Hi_

Rec

Filt

er3

Hi_

Rec

Filt

er2

Hi_

Rec

Filt

er1

6

Hi_

Rec

Filt

er1

5

Hi_

Rec

Filt

er1

4

Hi_

Rec

Filt

er1

3

Hi_

Rec

Filt

er1

2

Hi_

Rec

Filt

er1

1

Hi_

Rec

Filt

er1

0

Hi_

Rec

Filt

er1

Fix

Pt

Sum

9

Fix

Pt

Sum

8

Fix

Pt

Sum

7

Fix

Pt

Sum

6

Fix

Pt

Sum

5

Fix

Pt

Sum

4

Fix

Pt

Sum

3

Fix

Pt

Sum

2

Fix

Pt

Sum

16

Fix

Pt

Sum

15

Fix

Pt

Sum

14

Fix

Pt

Sum

13

Fix

Pt

Sum

12

Fix

Pt

Sum

11

Fix

Pt

Sum

10

Fix

Pt

Sum

1

17

A16

16

D16

15

D15

14

D14

13

D13

12

D12

11

D11

10

D10

9 D9

8 D8

7 D7

6 D6

5 D5

4 D4

3 D3

2 D2

1 D1

2

Upsam

ple

9

2

Upsam

ple

11

2

Upsam

ple

10

Low

_R

ec

Filte

r5

Hi_

Rec

Filte

r5

Fix

Pt

Sum

8

Fix

Pt

Sum

7

6

D6

Figure 5.6: Zoom on the developed synthesis block

147


R.

The analysis block has been implemented as a 16-level cascade, see

Fig. 5.4, containing high-pass filter operators (Hi Dec Filter), low pass

filter operators (Low Dec Filter) and Downsample operators. Hi Dec

Filter operators perform convolution between the incoming signal and

the Haar high pass decomposition filter, Low Dec Filter operators per-

form convolution between the incoming signal and the Haar low pass

decomposition filter, while the Downsample operators perform the dec-

imation of the incoming signal.

Fig. 5.5 shows the threshold and synthesis block which is subdivided

into 3 major sub-blocks: the sub-block on the left applies a threshold

on the input stream, the sub-block on the right performs the synthesis

of the signal, while the central block, called To Workspace, stores the

decomposition coefficients after the application of the threshold, so that

this value is used for calculating the percentage P of null coefficients.

The synthesis block has been implemented, in analogy to the analysis

block, as a 16-level cascade, see Fig. 5.6, containing Hi Rec Filter oper-

ators performing the convolution between the incoming signal and the

Haar high-pass reconstruction filter, Low Rec Filter operators perform-

ing the convolution between the incoming signal and the Haar low-pass

reconstruction filter, FixPt Sum operators performing the sum between

filtered signals and Upsample operators performing the upsampling on

the incoming signals.

Finally the Delay block shown in Fig. 5.3 is the block with the task of

starting the synthesis process only when the analysis job has already

been completed. It is to be noticed that the analysis, delay and synthe-

sis blocks have been developed starting from simple blocks belonging

to the Fixed Point Blockset, such as filtering, downsampling and up-

sampling blocks, and so on.

After performing the analysis and synthesis of the 10 SDD events with

a value of the threshold equal to 25 for the 3 architectures described

above, we have obtained the values shown in Tab. 5.11; as a notation

the floating point double precision standard architecture IEEE 754 is

148

5.4 — Multiresolution algorithm performances

indicated as ieee754doub, the single precision floating point standard

architecture IEEE 754 as ieee754sing and the fractional fixed point ar-

chitecture as fixed(s), where s is the number of bits representing the

fractional part of the number.

Simulink simulations show how the values P and E depend on the

precision of the selected architecture: in particular taking as a refer-

ence the values P and E less influenced from the finite precision of the

calculations, i.e. the values related to the architecture ieee754doub, it

can be noticed in the cases ieee754sing, fixed(18), fixed(15), fixed(12)

and fixed(9), a slight increase in the error E while P remains constant,

while in cases fixed(7), fixed(5) and fixed(3) the discrepancy with the

values obtained in the case ieee754doub increases strongly.

So far the results we have obtained pointed us towards the choice of

one of the following architectures: ieee754doub, ieee754sing, fixed(18),

fixed(15), fixed(12) and fixed(9). Our choice fell on the ieee754sing as

explained in Par. 5.5.

5.4 Multiresolution algorithm performances

For a direct comparison of the performances obtained by the compres-

sion algorithms implemented on the CARLOS prototypes and by the

multiresolution based algorithm, we developed a FORTRAN subrou-

tine running analysis and synthesis on a floating-point single precision

SPARC5 processor. The FORTRAN subroutine can be logically di-

vided in two parts: the first with the aim of giving an estimation of the

algorithm in terms of compression, the second with the aim of giving

an estimation of the reconstruction error on the cluster charge.

The first part of the subroutine performs analysis, threshold th appli-

cation and synthesis on SDD events containing several charge clusters.

After applying analysis and threshold, for each SDD event the recip-

rocal of the compression ratio is calculated c−1 = no output bitsno input bits

, with

the assumption that each non-null decomposition coefficient is encoded

149


using two 32-bit words, one representing the value of the coefficient

itself, the other representing the number of null coefficients between

the current and the previous non-null coefficient. So far the number of

bits entering the algorithm is the number of samples multiplied by 8

bits (64k × 8 = 512k), while the number of bits exiting the algorithm

is the number of non-null decomposition coefficients multiplied by the

32 + 32 = 64 bits used to encode each coefficient.

The second part of the FORTRAN subroutine performs analysis, thresh-

old application and synthesis to single-cluster SDD events.

After analysis, threshold th application and synthesis, the difference

between the coordinates of the cluster charge before compression and

after synthesis is computed for each SDD event, as long as the percent-

age difference between the charge of the cluster before compression and

after reconstruction.

Fig. 5.7, Fig. 5.8, Fig. 5.9, Fig. 5.10, Fig. 5.11 and Fig. 5.12 show the

value of the compression parameter c−1 for different threshold th val-

ues; in each figure the upper histogram represent the c values belonging

to 500 SDD events analyzed, while the lower hystogram represents the

c values related to SDD events whose c−1 value is less than 46× 10−3

(c = 22).

As shown in hystograms, the mean c values are lower than our target

value c−1 = 46 × 10−3 for each threshold value selected. So far the

multiresolution algorithms can reach an acceptable compression ratio

by putting a threshold of 20 on analyzed coefficients.

For what concerns the reconstruction error calculation up to now we

could use only 20 single-cluster events. So far the hystograms reporting

coordinate and charge difference before and after compression show a

very poor statistics.

For this reason the results we obtained on reconstruction error are

pretty qualitative up to now: in particular performing the analysis on

20 SDD events and using a threshold th level equal to 21 the differ-

ences on the centroid coordinates before and after compression are of

the order of magnitude of the µm, whereas the difference between clus-

150

5.5 — Hardware implementation

ter charge show a cluster underestimation of some percentual point.

These qualitative results belong to the same order of magnitude of the

compression algorithms implemented in CARLOS prototypes.

Figure 5.7: c−1 values for th=20

5.5 Hardware implementation

The hardware we have chosen for the implementation of the wavelet

based compression algorithm is a DSP chip from Analog Devices (AD):

the ADSP-21160. The DSP belongs to the Single Instruction Multiple

Data SHARC family produced by AD. It performs calculations both

in fixed-point and in single precision floating point at the same speed.

Our choice fell on this DSP also for this interesting feature, since it

allows us to try two different architectures with a single chip. The chip

has the following features:

– 600 MFLOPS (32-bit floating point) peak operation;

151




152




153



– 600 MOPS (32-bit fixed point) peak operation;

– 100 MHz core operation;

– 4 Mbits on-chip dual-ported SRAM;

– division of SRAM between program and data memory is user se-

lectable;

– 14 channels of zero overhead DMA;

– JTAG standard test access port.

Particularly interesting in this chip is the amount of memory hosted on-

chip: 4 Mbits are sufficient to store the algorithm program and at least

2 SDD events (each one requires 512 Kbits). So far while processing

one SDD event, an other one can be fetched into the internal SRAM

using the DMA channels, so increasing the total throughput.

The DSP has been bought together with an evaluation board and an

integrated development environment software VisualDSP, that allows

to write C code and download it to the DSP chip. The wavelet based

154


compression algorithm implementation on DSP is still in the design

phase, so far no data concerning algorithm speed are available up to

now for a quantitative comparison with the CARLOS chip prototypes.

155


Haar

1D

1 level 3 levels 5 levels

Threshold value th P E P E P E

0 7.78 3.02 e-15 9.05 7.11 e-15 9.12 1.26 e-14

1 17.51 0.22 23.67 0.26 24.68 0.27

2 31.23 0.65 38.11 0.62 40.01 0.63

3 40.09 1.01 55.81 1.21 58.60 1.64

4 44.28 1.25 63.48 1.56 67.08 1.71

5 47.84 1.52 71.20 2.00 75.56 2.09

6 48.78 1.61 74.80 2.26 79.87 2.38

7 49.31 1.68 77.81 2.52 83.56 2.68

8 49.71 1.74 80.38 2.79 86.71 2.99

9 49.78 1.76 82.02 2.99 88.82 3.23

10 49.87 1.78 83.41 3.19 90.70 3.48

11 49.91 1.79 84.50 3.38 92.21 3.72

12 49.94 1.80 85.17 3.50 93.20 3.89

13 49.97 1.81 85.81 3.64 94.16 4.07

14 49.98 1.82 86.25 3.75 94.81 4.21

15 49.98 1.83 86.60 3.84 95.33 4.34

16 49.99 1.83 86.85 3.92 95.72 4.44

17 50.00 1.84 87.02 3.98 96.03 4.54

18 50.00 1.84 87.12 4.02 96.20 4.60

19 50.00 1.84 87.24 4.07 96.41 4.67

20 50.00 1.84 87.32 4.10 96.54 4.72

21 50.01 1.84 87.36 4.12 96.62 4.76

22 50.01 1.84 87.40 4.14 96.69 4.79

23 50.01 1.84 87.42 4.16 96.73 4.81

24 50.01 1.85 87.43 4.17 96.76 4.83

25 50.01 1.85 87.45 4.18 96.79 4.85


the analysis has been performed on a number of levels equal to 1, 3, 5,

using the Haar set of filters.

156


Haar

2D

1 level 3 levels 5 levels

Threshold value th P E P E P E

0 3.54 5.32 e-15 3.67 1.5 e-14 3.68 2.50 e-14

1 18.90 0.26 22.06 0.28 22.21 0.28

2 36.05 0.69 42.33 0.74 42.63 0.75

3 46.42 1.07 55.90 1.19 56.34 1.19

4 55.25 1.47 67.15 1.66 67.76 1.67

5 60.69 1.80 74.78 2.07 75.50 2.09

6 64.01 2.06 79.95 2.42 80.77 2.44

7 66.46 2.30 84.03 2.75 84.96 2.77

8 68.30 2.51 87.18 3.05 88.21 3.08

9 69.73 2.70 89.64 3.33 90.75 3.36

10 70.95 2.90 91.72 3.59 92.88 3.63

11 71.87 3.06 93.25 3.82 94.49 3.87

12 72.63 3.22 94.51 4.03 95.80 4.08

13 73.20 3.35 95.46 4.21 96.78 4.26

14 73.65 3.47 96.21 4.36 97.56 4.42

15 74.06 3.59 96.84 4.51 98.20 4.57

16 74.38 3.69 97.34 4.64 98.73 4.71

17 74.53 3.75 97.63 4.72 99.05 4.80

18 74.65 3.80 97.82 4.79 99.25 4.86

19 74.76 3.85 98.01 4.85 99.44 4.93

20 74.82 3.87 98.11 4.89 99.55 4.97

21 74.87 3.90 98.20 4.93 99.64 5.01

22 74.91 3.92 98.25 4.95 99.69 5.03

23 74.93 3.94 98.29 4.97 99.74 5.05

24 74.94 3.95 98.32 4.99 99.77 5.07

25 74.96 3.96 98.35 5.00 99.80 5.09


the analysis has been performed on a number of levels equal to 1, 3, 5, using

the Haar set of filters.

157


Architecture Precision P E

ieee754doub 2−52 99.88 5.07

ieee754sing 2−23 99.88 5.11

fixed(18) 2−18 99.88 5.11

fixed(15) 2−15 99.88 5.11

fixed(12) 2−12 99.88 5.11

fixed(9) 2−9 99.88 5.11

fixed(7) 2−7 99.87 6.04

fixed(5) 2−5 99.81 12.75

fixed(3) 2−3 99.52 89.09

Table 5.11: Mean values of P ed E on 10 SDD events (∆P ≈ ∆E ≈ 0.01),

obtained with Simulink simulations

158

Conclusions

The main goal of this thesis work was the search for compression algo-

rithms and its hardware implementation to be applied to data coming

out from the Silicon Drift Detectors in the ALICE experiment.

ALICE and, in general, LHC experiments put very stringent constraints

on the compression algorithms for what concerns compression ratio,

reconstruction error, speed, flexibility and so on. For example data

produced by the SDD have to be reduced of a factor of 22 in order

to satisfy the constraints on disk space for permanent storage. So far

many standard compression algorithms have been studied in order to

find which one could obtain the best trade-off between compression

ratio and reconstruction error, i.e. distortion introduced. It is rather

obvious, in fact, that a high compression ratio such as 22 can only be

achieved at the expense of some loss of information on the physical

charge distribution over the SDD surface.

Three hardware prototypes implementing data compression are pre-

sented in the thesis: the front-end chip CARLOS v1, v2 and v3. Their

evolution from version 1 to version 3 reflects the architectural changes

in the readout chain occurred during the 3 years of the work. Three

major reasons can be used to justify these changes:

– the necessity to work in a radiation environment, forcing us to

choose a radiation-tolerant technology;

– the lack of space for the SIU board, forcing us to change the

readout architecture;

159

CONCLUSIONS

– the change from a uni-dimensional (1D) compression algorithm to

a bi-dimensional one (2D), in order to have the same compression

ratio as in 1D, while using lower thresholds, thus losing a smaller

amount of physical data.

We plan that CARLOS v4 will be the final version of the chip: it will

contain the 2D algorithm and will be designed to be compliant with

the new readout architecture. It should be sent to the foundry before

the end of 2002.

One of the main features of these chips is that lossy compression can be

switched off when needed and turned to lossless compression. Lossless

data compression becomes necessary if compression algorithms imple-

mented on the CARLOS chips are no longer applicable. For exam-

ple the 2D compression algorithm does not work fine in presence of a

slope on the anodic signal baseline. In this case on-line compression on

the front-end has to be switched off and a second level compressor in

counting room has to do the job. For this kind of application different

compression algorithms have to be studied.

In alternative to the 1D and 2D algorithms, our group in Bologna

decided to study a wavelet based compression algorithm, in order to

decide if it could be useful for a possible second level data compression.

Our simulations proved that the algorithm show good performances

for what concerns both the compression ratio and the reconstruction

error. We are still working in order to obtain some more quantitative

results and, at the same time, an implementation on DSP is planned for

the near future in order to evaluate compression speed and how many

DSPs would be necessary for the task. The use of DSP in counting

room may be very convenient since, differently from ASICs, they are

completely reprogrammable via software if needed. So far as many as

different compression algorithms as wanted can be tried on the input

data in order to find the best one.

160

Bibliography

[1] ALICE Collaboration, “Technical Proposal for A Large Ion

Collider Experiment at the CERN LHC”, December 1995,

CERN/LHCC/95-71.

[2] The LHC study group, “The Large Hadron Collider Conceptual

Design”, October 1995, CERN/AC/95-05(LHC).

[3] P. Giubellino, E. Crescio, “The ALICE experiment at LHC:

physics prospects and detector design”, January 2001, ALICE-

PUB-2000-35.

[4] CERN/LHCC 99-12 ALICE TDR 4, 18 June 1999.

[5] E. Crescio, D. Nouais, P. Cerello, “A detailed study of charge diffu-

sion and its effect on spatial resolution in Silicon Drift Detectors”,

September 2001, ALICE-INT-2001-09.

[6] F. Faccio, K. Kloukinas, G. Magazzu, A. Marchioro, “SEU

effects in registers and in a Dual-Ported Static RAM designed in

a 0.25 µm CMOS technology for applications in the LHC”, Fifth

Workshop on Electronics for LHC Experiments, September 20-24,

1999, pages 571-575.

[7] K. Sayood, “Introduction to Data Compression”, Morgan Kauf-

mann, S. Francisco, 1996.

[8] E.S. Ventsel “Teoria delle probabilita”, Mir edition.

[9] S. W. Smith, “The Scientist and Engineer’s Guide to Digital Signal

Processing”, California Technical Publishing, S. Diego, 1999.

161

BIBLIOGRAPHY

[10] J. Badier, Ph. Busson, A. Karar, D.W. Kim, G.B. Kim,, S.C.

Lee, “Reduction of ECAL data volume using lossless data com-

pression techniques”, Nuclear Instruments and Methods in Physics

Research A 463 (2001), pages 361-374.

[11] R. Polikar, “The Engineer’s ultimate guide to wavelet analysis”,

http://engineering.rowan.edu/˜polikar/WAVELETS/WTtutorial.html,

2001.

[12] P. G. Lemarie, Y.Meyer, “Ondelettes et bases hilbertiennes”, Riv-

ista Matematica Iberoamericana, Vol. 2, pages 1-18, 1986.

[13] E. J. Stollnitz, T. D. DeRose e D. H. Salesin, “Wavelets for

computer graphics: a primer”, IEEE Computer Graphics and Ap-

plications, Vol. 3, NO. 15, pages 76-84, May 1995 (part 1) and

Vol. 4, NO. 15, pages 75-85, July 1995 (part 2). Vol. 3, NO. 15,

pages 76-84, May 1995.

[14] P. Morton, “Image Compression Us-

ing the Haar Wavelet Transform”,

http://online.redwoods.cc.ca.us/instruct/darnold/maw/haar.htm,

1998.

[15] B. Burke Hubbard, “The World According to Wavelets: the story

of a mathematical technique in the making”, A K Peters, Ltd.,

Wellesley, 1998.

[16] S. G. Mallat, “A Theory for Multiresolution Signal Decomposition:

The Wavelet Representation”, IEEE Transactions on pattern anal-

ysis and machine intelligence, Vol. II, NO. 7, pages 674-693, July

1989.

[17] D. Cavagnino, P. De Remigis, P. Giubellino, G. Mazza, e

A. E. Werbrouck, “Data Compression for the ALICE Silicon Drift

Detector”, 1998, ALICE-INT-1998-41.

[18] Pankaj Gupta and Nick McKeown, “Designing and Implementing

a Fast Crossbar Scheduler“, Jan/Feb 1999, IEEE Micro.

162

BIBLIOGRAPHY

[19] D. Cavagnino, P. Giubellino, P. De Remigis, A. Werbrouck, G.

Alberici, G. Mazza, A. Rivetti, F. Tosello, “Zero suppression and

Data Compression for SDD Output in the ALICE Experiment”,

Internal note/SDD, ALICE-INT-1999-28 V 1.0.

[20] P. Moreira, J. Christiansen, A. Marchioro, E. van der Bij, K.

Kloukinas, M. Campbell, G. Cervelli, “A 1.25 Gbit/s Serializer

for LHC Data and Trigger Optical Links”, Fifth Workshop on

Electronics for LHC Experiments, September 20-24, 1999, pages

194-198.

[21] F. Wang, “BIST using pseudorandom test vectors and signa-

ture analysis”, IEEE 1988 Custom Integrated Circuits Conference,

CH2584-1/88/0000-0095.

[22] T.W. Williams, W. Daehn, “Aliasing errors in multiple input sig-

nature analysis registers”, 1989 IEEE, CH2696-3/89/0000/0338.

[23] M. Misiti, Y. Misiti, G. Oppenheim and J. M. Poggi, “Wavelet

Toolbox User’s Guide”, The MathWorks, Inc., Natick, 2000.

[24] “Simulink User’s Guide: Dynamic System Simulation for Matlab”,

The MathWorks, Inc., Natick, 2000.

[25] “Fixed-Point Blockset User’s Guide: for Use with Simulink”, The

MathWorks, Inc., Natick, 2000.

163

HARDWARE IMPLEMENTATION OF DATA COMPRESSION …falchier/tesidottorato/tesi_final.pdf ·...

Documents

Transcript of HARDWARE IMPLEMENTATION OF DATA COMPRESSION …falchier/tesidottorato/tesi_final.pdf ·...