Post on 07-Jan-2016
description
Crosswords and Information Theory
Peter Andreasen
Deember 17, 2000
Abstrat
A gentle introdution to the wonders of the infor-
mation theoretial onept of entropy through ele-
mentary alulation of the number of rosswords.
What is a rossword, really?
Most people have solved a rossword puzzle or
played Srabble. The existene of word-games like
those are not to be taken for granted, though. As
we are going to see, the existene of rosswords is
entirely at the mery of the underlying language. In
fat, there is a onnetion between the information
theoreti onept \entropy", and the possibility of
reating rossword puzzles!
Before we proeed, we need to get a few deni-
tions straight. First, what is a rossword? Let us
take a look at one:
g e m
a r e
m a t h
e e
What we see are rows and olumns of words (single
letters are a
epted as words) separated by white
squares.
1
Now, the words in a rossword need not
be English as in the example above. We might
want to reate a Danish rossword or we might
even want to have the olumns and rows be quotes
from Shakespeare's sonnets. To be able to handle
suh omplex rules for the reation of rosswords
we make the following denition.
1
It is in the white squares you will normally nd the hints
needed to solve the puzzle { and in most rosswords the
topmost row and the leftmost olumn are lled with these
hints. For simpliity we will make no assumptions about the
plaement of the white squares.
Denition 1 A language L is a set of sequenes
of letters from an alphabet A (say, the letters 'a'
through 'z' and the symbol ''). A rossword of
size n is a matrix with the dimensions nn where
all of the rows are sequenes (of length n) from L
and all of the olumns are sequenes (of length n)
from L.
So if we want to make a really sophistiated
rossword, we may let L be all the possible quotes
from Shakespeare. In that ase we should use an al-
phabet whih inluded the letters as well as spae
and the various puntation symbols. If we want-
ed to make a 'lassi' rossword, we would have L
be equal to any sequene you an make by taking
words from a ditionary and gluing them together
with one or more 's inbetween. In this ase the
alphabet A would just be the letters and the speial
symbol .
How many are there?
It is obvious, that very small rosswords are easi-
ly onstruted. Espeially rosswords whih have
only one row or one olumn. It is also easy to re-
ate a few very big but very dull rosswords: If you
keep alternating the rows between 'I I I ' and
' I I ', you ertainly get a valid (and as big
as you like) 'Classi English' rossword. So we want
to not only onsider the existene of big rosswords,
but also hek if there are many dierent of them.
We are going to alulate the number of big ross-
words now.
Assume we have hosen an alphabet A and a lan-
guage L over whih the rosswords must be made.
We use the notation jAj for the number of letters
and symbols in the alphabet. Let us introdue the
following number as well,
L
(n) = number of sequenes from L of length n:
1
So for onstruting a square rossword of size nn
over the language L, there are
L
(n) possible hoi-
es for the rst row. We will now use a small trik
and for a moment employ a bit of probability the-
ory: If we piked an absolutely random sequene
of n letters from A, what are the hane that we
got a 'valid' row, that is, a sequene from L? The
answer is
L
(n)
jAj
n
beause there are
L
(n) valid sequenes and jAj
n
possible sequenes. An example: suppose we want-
ed to reate a normal English rossword. In my
ditionary, there are 1 word (namely 'I') of length
1 and 49 words of length 2. Valid sequenes of
length 2 are '', ' I', 'I', and then the 49
words of length 2. A total of 52 sequenes, that is,
L
(2) = 52. The total number of possible sequenes
of length 2 is jAj
2
= 2727 = 729 (note, that even
though we only have 26 letters, the size of A is 27,
beause we need the symbol as well). And thus
the probability of getting a valid sequene of length
2 would be 52=729 0:07, in this example.
However, that was the probability of just one
valid row. What about the rest? The probability of
all n rows being valid equals the above probability
multiplied with itself n times
2
,
L
(n)
jAj
n
n
=
L
(n)
n
jAj
n
2
:
Now for the olumns the situation is idential. And
beause the olumns are as high as the rows are
wide, the result is the same: The probability of all
n olumns being valid (that is, from L) equals
L
(n)
n
jAj
n
2
:
Now we may alulate the probability of a random-
ly seleted matrix of n n letters from A being in
fat a rossword: We want both its rows and its
olumns to be valid, so we multiply:
L
(n)
n
jAj
n
2
2
=
L
(n)
2n
jAj
2(n
2
)
2
This is basi probability theory. It is omparable to
when we say that the probability of a oin landing heads up
equals
1
2
and then proeed to alulate the probability of two
heads in a row as
1
2
1
2
=
1
4
. We multiply the probabilities
when we want the probability of both events.
We now return to our original question: How
many (big) rosswords are there? Well, we know
the probability of a randomly seleted matrix of
n n letters being a rossword, and there are a
total of jAj
nn
= jAj
n
2
possible n n matries, so
we may write
3
N
n
= jAj
n
2
L
(n)
2n
jAj
2(n
2
)
=
L
(n)
2n
jAj
n
2
:
This makesN
n
our symbol for the mumber of ross-
words of size n n.
Explosive numbers
To get to the ore of the matter, we need to do a
bit of mathematial wizardry, so now is the time to
wear your pointed hat! First, we apply the loga-
rithm
4
to N
n
:
logN
n
= 2n log
L
(n) n
2
log jAj
= 2n
2
log
L
(n)
n
log jAj
2
=
2n
2
log jAj
log
jAj
L
(n)
n
1
2
The speial symbol log
jAj
is simply the logarithm to
the base of jAj, that is, jAj
log
jAj
x
= x. Reall that
we are interested in the number N
n
when n grows
large. In the expression above, the rst fration,
2n
2
log jAj
;
just grows towards innity as n does the same. The
seond fration,
n
=
log
jAj
L
(n)
n
is more interesting (so we name it
n
). The value
of
L
(n) must be between 0 and jAj
n
(that should
be lear from the denition of
L
(n)). So (assuming
3
This is another appliation of basi probability theory:
The number of valid rosswords are alulated as the proba-
bility of a random matrix being valid times the total number
of possible matries.
4
Reall, that taking the logarithm of a produt yields a
sum (log ab = log a + log b), a logarithm of a fration yields
a dierene (log a=b = log a log b) and the logarithm of a
power turns into a produt (log a
b
= b log a).
2
L
(n) > 0)) we see that log
jAj
L
(n) is between 1
and n. Thus, when n grows large, value of
n
stays
between 0 and 1. Let us assume that
n
in fat
onverges
5
to some number between 0 and 1. We
may now onlude, that if 3) ross-
words. The probability of one dimension (think:
row) of the rosswords being valid equals
L
(n)
jAj
n
n
d1
=
L
(n)
n
d1
jAj
n
d
:
This is almost the same result as before, but note
the exponent n
d1
. In the ase d = 3, where we
might imagine the rossword as a ube made up
of 'stiks' of sequenes from L, the exponent orre-
sponds to the fat that in eah dimension there are
n
2
stiks. The probability of all dimensions (think:
rows and olumns) being valid equals
L
(n)
n
d1
jAj
n
d
!
d
=
L
(n)
dn
d1
jAj
dn
d
:
Again, this should ome as no shok. The total
number of possible rosswords (think: any matrix)
is multipliated with the probability and we nd:
N
(d)
n
= jAj
n
d
L
(n)
dn
d1
jAj
dn
d
=
L
(n)
dn
d1
jAj
n
d
(d1)
:
Applying logarithm yields
logN
(d)
n
= dn
d1
log
L
(n) n
d
(d 1) log jAj
and reorganizing the terms,
logN
(d)
n
=
dn
d
log jAj
log
jAj
L
(n)
n
d 1
d
: (1)
The seond fration in the above expression is
reognized from before. We reall that the number
is used to denote the limiting value of the fration
as n beomes very big. We nd, that if, say, d = 3
the value of must be at least
2
3
if we want to
have many, big rosswords. As the dimension of
the rosswords grow, the languages L must have
larger and larger -value to sustain the notion of
many rosswords.
It seems like
L
expresses something fundamen-
tal about the language L. So information theorists
have a name for that value:
Denition 2 Let L be a language. The entropy of
L is dened as
~
H(L) = lim
n!1
log
jAj
L
(n)
n
:
3
We reognize the entropy as the same thing as
we know as
L
. The little symbol above
~
H is there
to remind us that this is a speial kind of entropy:
The theory leading up this denition is not as on-
ise and rigid as many information theorists would
want. But we should not feel nothing has been a-
omplished: Our entropy aptures some very deep
aspets of the onept.
We may wonder what happens if, say,
~
H(L) =
1
4
. What kind of rosswords are possible? Why,
rosswords of dimension d >
4
3
of ourse! If d =
4
3
we have (d1)=d =
1
4
. How to visualize a rossword
in 1:333 dimensions is probably better left as an
exerise to the reader!
A reapitulation
Let us briey examine what we have learnt so far:
We introdued the onept of language whih is
nothing but a set of sequenes of letters. We have
then made a satisfatory denition of what is a
rossword over a language. Using elementary om-
binatoris and probability theory we have alulat-
ed the number of valid rosswords of size n n
(or, in the ase of other dimensions, size n
d
). This
number depends on the size of the alphabet, jAj,
as well as the speial funtion
L
(n). We then ob-
served, that there are essentially two dierent ases:
In the rst ase (
L
d1
d
), the same number be-
omes innitely big as the size grows.
This alls for a reformulation of our initial ques-
tion: While we opened this paper asking about the
existene of rosswords, we are now tempted to ask:
\Given a language L, what is the greatest dimen-
sion d for whih there are many (big) rosswords?".
This move enourages us to onsider non-integer
values of d, and thus we have denitely left the
realm of ordinary rosswords puzzles, and maybe
the real world as well! The answer to the new ques-
tion is related to the entropy as we have just seen.
In fat, ombining the denition of entropy and for-
mula (1), shows
~
H(L) =
d
0
1
d
0
and d
0
=
1
1
~
H(L)
;
where d
0
is exatly the largest dimension where it
is possible to reate (many) rosswords over L.
Note how d
0
may be arbitrarily big, even 1 if
the entropy equals 1. How is that for a rossword
puzzle! Atually, if
~
H(L) = 1 it is quite trivial to
reate rossword puzzles (in any dimension). An
example of suh a language L is that whih is made
up of every integer. The alphabet A is just the dig-
its, and
L
(n) = jAj
n
= 10
n
(beause any sequene
of length n whih is made up from digits, is a valid
number) so learly
~
H(L) = 1.
We have arrived at the onept of entropy by
a quite unusual method. Aside from (hopefully)
some pedagogial advantages there are other rea-
sons for piking this approah: We now have an
entropy onept dened on any language or, whih
is the same, any set of sequenes made up of letters
from A. This is not true for the traditional entropy
whih is introdued by the onept of information
soures (whih are also known as stohasti pro-
esses, and are based on a quite tehnial proba-
bility theoreti framework). In addition, while our
entropy, in its urrent form, does not handle ran-
dom languages (e.g. the language of all possible se-
quenes of 0's and 1's reated by ipping a oin),
it is possible to rene our denitions to over (and
indeed generalize the probability theoreti entropy)
these important ases as well.
Entropy
Something should probably be said about why en-
tropy is so entral to information theory. But where
to start and where to end! We will look at only two
aspets, one of somewhat philosophial nature and
the other of very pratial nature.
Entropy is often said to be a measure of how
'omplex' or even 'aoti' things are. This orre-
sponds niely with the observations given above: A
language made up of every possible integer is devoid
of any form or struture. Anything goes. It is im-
possible to distinguish between the a sequene from
the language and a sequene of ompletely random
digits. This language, as explained above, has the
entropy 1. On the other hand, a language made up
of sequenes of only one letter, say 'a', is omplete-
ly strutured. No room for hoies. The funtion
L
(n) is onstantly 1 regardless of the value of n.
This orresponds to the ase where
~
H(L) = 0.
The other use of entropy whih we will touh up-
on is data ompression. We mention the following
4
theorem in sketh form:
Theorem 1 (Shannon) Let L be a language over
A. There exists an enoding suh that any sequene
x 2 L of length n may be enoded into a sequene
of no more than n(
~
H(L)+) letters. This holds for
any positive number however small, provided the
length n of x is large enough.
This formulation only aptures the essene of
Shannon's theorem. What is important is the or-
der in whih things happen: First we hoose the
value as small as we want it. This determines
how \lose" to the entropy we want our enoding
to be. Then the theorem tells us that there exists a
number N and a ode so that any sequene x 2 L
whih are at least N letters long an be enoded
into just jxj(
~
H(L) + ) letters. So if the entropy of
L is
1
2
we an ompress long sequenes from L by
a fator 2.
This onludes our tour. The onnetion be-
tween the omplexity of a language and the ability
to reate rosswords may not ome as a surprise.
But that this onnetion leads diretly to entropy,
the ornerstone of information theory is, at the very
least, rather neat.
Notes
This setion ontains some notes about the history
of the results. It is probably most interesting to
readers already familiar with the onepts in this
paper. The idea of linking rossword puzzles and
entropy is, in fat, as old as [Shannon, 1948, from
whih we quote the last paragraph of setion 7:
The redundany of a language is related
to the existene of rossword puzzles. If
the redundany is zero any sequene of let-
ters is a reasonable text in the language
and any two-dimensional array of letters
forms a rossword puzzle. I fthe redun-
dany is too high the language imposes
too many onstraints for large rossword-
s puzzles to be possible. A more detailed
analysis shows that if we assume the on-
straints imposed by the language are of a
rather haoti and random nature, large
rossword puzzles are just pussible when
the redundany is 50%. If the redundany
is 33%, three-dimensional rossword puz-
zles should be possible, et.
For a more detailed disussion as well as a bit of
history on the result see the last part of [Immink
et al., 1998. The entropy
~
H introdued in this
paper is releated to the Hartley entropy and Haus-
dor dimensions of 'nie' subsets of A
1
(onsid-
ered as subsets of [0; 1). Or, if one onsiders arbi-
trary subsets of A
1
, the entropy
~
H might be inter-
pretated as a form of the box ounting dimension,
see e.g. [Faloner, 1990. The onnetion between
entropy and Hausdor dimension is desribed in
[Billingsley, 1965 and interesting results in this di-
retion an be found in [Ryabko, 1986.
Referenes
Billingsley, P. Ergodi theory and Information.
John Wiley & Sons, 1965.
Faloner, K. Fratal Geometry - Mathematial
Foundations and Appliations. John Wiley &
Sons, 1990.
Immink, K.A.S., P.H. Siegel and J.K. Wolf. Codes
for digital reorders. IEEE Trans. Inform. The-
ory, 44(6):2260{2299, 1998.
Ryabko, B. Y. Noiseless oding of ombinatori-
al soures, hausdor dimensoin, and kolmogorov
omplexity. Problems of Inform. Trans., 22(3):
170{179, 1986.
Shannon, C.E. A mathematial theory of ommu-
niation. Tehnial report, Bell System, 1948.
5