Post on 18-Dec-2015
Markov, Shannon, and Turbo Codes:The Benefits of Hindsight
Professor Stephen B. Wicker
School of Electrical Engineering
Cornell University
Ithaca, NY 14853
Introduction
Theme 1: Digital Communications, Shannon and Error Control Coding
Theme 2: Markov and the Statistical Analysis of Systems with Memory
Synthesis: Turbo Error Control: Parallel Concatenated Encoding and Iterative Decoding
Digital Telecommunication
The classical design problem: transmitter power vs. bit error rate (BER)
Complications:– Physical Distance– Co-Channel and Adjacent Channel Interference– Nonlinear Channels
Shannon and Information Theory
Noisy Channel Coding Theorem (1948): – Every channel has a capacity C.– If we transmit at a data rate that is less
than capacity, there exists an error control code that provides arbitrarily low BER.
For an AWGN channel:
C =Wlog2 1+Es
N0
⎛
⎝ ⎜ ⎞
⎠ ⎟ bits per second
Coding Gain
Coding Gain: PUNCODED - PCODED
– The difference in power required by the uncoded and coded systems to obtain a given BER.
NCCT: Almost 10dB possible on an AWGN channel with binary signaling.
1993: NASA/ESA Deep Space Standard provides 7.7 dB.
Classical Error Control Coding
MAP Sequence Decoding Problem: – Find X that maximizes p(X|Y).– Derive estimate of U from estimate of X.– General problem is NP-Hard - related to many
optimization problems.– Polynomial time solutions exist for special
cases.
EncoderU=(u1, ... , uk) X
NoisyChannel
Y
Class P Decoding Techniques
Hard decision: MAP decoding reduces to minimum distance decoding.
Example: Berlekamp algorithm (RS codes) Soft Decision: Received signals are quantized. Example: Viterbi algorithm (Convolutional
Codes) These techniques do NOT minimize
information error rate.
Binary Convolutional Codes
Memory is incorporated into encoder in an obvious way.
Resulting code can be analyzed using state diagram.
Trellis for a Convolutional Code
Trees and Sequential Decoding
Convolutional code can be depicted as a tree. Tree and metric define a metric space. Sequential decoding is a local search of a
metric space. Search complexity is a polynomial function of
memory order. May not terminate in a finite amount of time. Local search methodology to return...
Theme 2: Markov and Memory
Markov was, among many other things, a cryptanalyst.– Interested in the structure of written text.– Certain letters are can only be followed by
certain others. Markov Chains:
– Let I be a countable set of states and let be a probability measure on I.
– Let random variable S range over I and set i = p(S = i)
– Let P = {pij} be a stochastic matrix with rows and columns indexed by I.
– S = (Sn)n≥0 is a Markov chain with initial distribution and transition matrix P if
- S0 has distribution
- p(Sn+1 | S0, S1, S2, …, Sn – 1, Sn) = P(Sn+1 | Sn) = pij
Hidden Markov Models
HMM :
– Markov chain X = X1, X2, …
– Sequence of r.v.’s Y = Y1, Y2, … that are a probabilistic function f() of X.
Inference Problem: Observe Y and infer:– Initial state of X– State transition probabilities for X– Probabilistic function f()
Hidden Markov Models are Everywhere...
Duration of eruptions by Old Faithful Movement of Locusts (Locusta Migratoria) Suicide rate in Capetown, SA. Progress of epidemics Econometric models Decoding of convolutional codes
Baum-Welch Algorithm
Lloyd Welch and Leonard Baum developed iterative solution to the HMM inference problem (~1962).
Application-specific solution was classified for many years.
Published in general form:– L. E. Baum and T. Petrie, “Statistical Inference for
Probabilistic Functions of Finite State Markov Chains,” Ann. Math. Stat., 37:1554 - 1563, 1966.
BW Overview
Member of the class of algorithms now known as “Expectation-Maximization”, or “EM” algorithms.
– Initial hypothesis 0
– Series of estimates generated by the mapping i = T(i-1)
– P(0) ≤ P(1) ≤ P(2) ≤ … , where is the maximum likelihood parameter estimate.
limi→ ∞
i
Forward - Backward Algorithm: Exploiting the Markov Property
Goal: Derive probability measure p(xj, y).
BW algorithm recursively computes ’s and ’s.
p x j ,y( ) =p xj ,y j−( ) ⋅p yj |xj ,y j
−( ) ⋅p yj+ |xj ,yj ,y j
−( )
=p xj ,y j−( ) ⋅p yj |xj( )⋅p y j
+ |xj( )
= xj( )↑
past
⋅ γ xj( )↑
present
⋅β x j( )↑
future
Forward and Backward Flow
Define flow (xi, xj) to be the probability that a random walk starting at xi will terminate at xj.
(xj) is the forward flow to xj at time j.
(xj) is the backward flow to xj at time j.
x j( ) = p x j ,y j−
( ) = α x j−1( )Q x j | x j−1( )γ x j−1( )x j−1∈X j−1
∑
x j( ) = p y j+ | x j( ) = Q x j+1 | x j( )γ x j+1( )β x j+1( )
x j+1∈X j+1
∑
Earliest Reference to Backward-Forward Algorithm
Several of the woodsmen began to move slowly toward her and observing them closely, the little girl saw that they were turned backward, but really walking forward. “We have to go backward forward!” cried Dorothy. “Hurry up, before they catch us.”
– Ruth Plumly Thompson, The Lost King of Oz, pg. 120, The Reilly & Lee Co., 1925.
Generalization: Belief Propagation in Polytrees
Judea Pearl (1988) Each node in a
polytree separates the graph into two distinct subgraphs.
X D-separates upper and lower variables, implying conditional independence.
Spatial Recursion and Message Passing
Synthesis: BCJR
1974: Bahl, Cocke, Jelinek, and Raviv apply portion of BW algorithm to trellis decoding for convolutional and block codes.– Forward and backward trellis flow: APP
that a given branch is traversed.– Info bit APP: sum of probabilities for
branches associated with particular bit value.
BW/BCJR
u j( ) uj( )γ u j( )
Synthesis Crescendo:Turbo Coding
May 25, 1993: G. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon-Limit Error Correction Coding: Turbo Codes.”
Two Key Elements: – Parallel Concatenated Encoders– Iterative Decoding.
Parallel Concatenated Encoders
One “systematic” and two parity streams are generated from the information.
Recursive (IIR) convolutional encoders are used as “component” encoders.
Recursive Binary Convolutional Encoders
Impact of the Interleaver
Only a small number of low-weight input sequences are mapped to low-weight output sequences.
The interleaver ensures that if the output of one component encoder has low weight, the output of the other probably will not.
PCC emphasis: minimize number of low weight code words, as opposed to maximizing the minimum weight.
The PCE Decoding Problem
CC2
Encoder
Interleaver
CC1
Encoder
U =( u
1
, ... , u
k
)
X
1
U
Channel
Y
1
=( y
1
, ... , y
1
)
1 k
Y
s
=( y
s
, ... , y
s
)
1 k
X
2
Y
2
=( y
2
, ... , y
2
)
1 k
BELi a( ) =p(ui =a|y)
= i a( )systematic
term
{ ⋅π i a( ) a prioriterm
{ ⋅ p(y1 |x1)p(y2 |x2 ) j uj( )π j uj( )j=1j≠i
k
∏u:ui =a∑
extrinisic term1 2 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4
Turbo Decoding
BW/BCJR decoders are associated with each component encoder.
Decoders take turns estimating and exchanging distribution on information bits.
Alternating Estimates of Information APP
BELi1 a( ) =α λi a( )
systematicterm
{ ⋅π i2 a( )
updatedterm
1 2 3 ⋅ p(y1 |x1) λ j uj( )π j uj( )j=1j≠i
k
∏u:ui =a∑
extrinisic term1 2 4 4 4 4 4 3 4 4 4 4 4
BELi2 a( ) =α λi a( )
systematicterm
{ ⋅πi1 a( )
updatedterm
1 2 3 ⋅ p(y2 |x2) λj uj( )π j uj( )j=1j≠i
k
∏u:ui =a∑
extrinisic term1 2 4 4 4 4 4 3 4 4 4 4 4
Decoder 1: BW/BCJR derives
Decoder 2: BW/BCJR derives
Converging Estimates
Information exchanged by the decoders must not be strongly correlated with systematic info or earlier exchanges.
πim( ) a( ) =
αPr ui =a |Ys =ys,Y1 =y1{ }
λi a( )πim−1( ) a( )
if m is odd
αPr ui =a|Ys =ys,Y2 =y2{ }
λi a( )πim−1( ) a( )
if m is even
⎧
⎨ ⎪ ⎪
⎩ ⎪ ⎪
Impact and Questions
Turbo coding provides coding gain near 10dB – Within 0.3 dB of the Shannon limit.– NASA/ESA DSN: 1 dB = $80M in 1996.
Issues:– Sometimes turbo decoding fails to correct all
of the errors in the received data. Why?– Sometimes the component decoders do not
converge. Why?– Why does turbo decoding work at all?
Cross-Entropy Between the Component Decoders
Cross entropy, or the Kullback-Leibler distance, is a measure of the distance between two distributions.
Joachim Hagenauer et al. have suggested using a cross-entropy threshold as a stopping condition for turbo decoders.
D= π 1 uj =a|Y( )a=0
1
∑j=1
N
∑ logπ1 uj =a |Y( )π 2 uj =a|Y( )
Correlating Decoder Errors with Cross-Entropy
Neural Networks do the Thinking
Neural networks can implement any piecewise-continuous function.
Goal: Emulation of indicator functions for turbo decoder error and convergence.
Two Experiments: – FEDN: Predict eventual error and convergence
at the beginning of the decoding process.– DEDN: Detect error and convergence at the
end of the decoding process.
Network Performance
Missed detection occurs when number of errors is small.
The average weight of error events in NN-assisted turbo is far less than that of CRC-assisted turbo decoding.
When coupled with a code combining protocol, NN-assisted turbo is extremely reliable.
What Did the Networks Learn?
Examined weights generated during training. Network monitors slope of cross entropy (rate
of descent). Conjecture:
– Turbo decoding is a local search algorithm that attempts to minimize cross-entropy cycles.
– Topology of search space is strongly determined by initial cross entropy.
Exploring the Conjecture
Turbo Simulated Annealing (Buckley, Hagenauer, Krishnamachari, Wicker)– Nonconvergent turbo decoding is nudged
out of local minimum cycles by randomization (heat).
Turbo Genetic Decoding (Krishnamachari, Wicker)– Multiple processes are started in different
places in the search space.
Turbo Coding: A Change in Error Control Methodology
“Classical” response to Shannon: – Derive probability measure on transmitted
sequence, not actual information.– Explore optimal solutions to special cases of
NP-Hard problem.– Optimal, polynomial time decoding algorithms
limit choice of codes.
“Modern”: Exploit Markov property to obtain temporal/spatial recursion:– Derive probability measure on information, not
codeword– Explore suboptimal solutions to more difficult
cases of NP-Hard problem.– Iterative decoding – Graph Theoretic Interpretation of Code Space– Variations on Local Search
The Future Relation of cross entropy to impact of cycles in
belief propagation. Near-term abandonment of PCE’s as unnecessarily
restrictive. Increased emphasis on low density parity check
codes and expander codes.– Decoding algorithms that look like solutions to K-
SAT problem.– Iteration between subgraphs.– Increased emphasis on decoding as local search.