# Information Theory part 11: Claude Shannon: A Mathematical Theory of Communication

developing his theories related to cryptography
and therefore was well aware that human
communication was a mix of randomness and
statistical dependencies. Letters in our
messages were obviously dependent on previous
letters to some extent. And in 1949, he published
a groundbreaking paper, a mathematical theory
of communication. And in it, he uses Markov
models as the basis for how we can think
about communication. And he starts with
a toy example. Imagine you encounter a
bunch of texts written in an alphabet of
A, B, and C. Perhaps you know nothing
together while B and Cs do not. He then shows that you
could design a machine to generate similar looking
text using a Markov chain. And he starts off with a
zeroth-order approximation, which means we just
independently select each symbol, A, B, or C at
random, and form a sequence. However, notice
that this sequence doesn’t look like the original. He shows then you
could do a bit better with a first-order approximation
where the letters are chosen independently,
but according to the probability
of each letter in the original sequence. So this is slightly better,
as As are now more likely, but it still doesn’t
capture much structure. The next step is key. A second-order
order approximation takes into account each pair
of letters which can occur. And in this case, we
need three states. The first state
represents all pairs which began with A, the second,
all pairs that begin with B, and the third state, all
pairs that begin with C. And notice now that the A
cup as many AA pairs, which makes sense since the
conditional probability of an A after an A is higher in
our original message. Now we can generate a sequence
using this second-order model easily, as follows. We start anywhere
and pick a tile, and we write down or
output the first letter and move to the cup defined
by the second letter. Then we pick a new tile
and repeat this process indefinitely. Notice that this
sequence is starting to look very similar to
the original message, because this model is capturing
the conditional dependencies between letters. And if we want to
do even better, we could move to a third
order approximation which takes into account groups of
three letters, or trigrams. And in this case, we
would need nine states. But next, Shannon applies
this exact same logic to actual English
texts, using statistics that were known for letters,
pairs, and trigrams, et cetera. And he shows the
same progression from zeroth-order,
random letters, to first-order, second-order,
and third-order sequences. He then goes on and
tries this same thing using words instead of letters. And he writes “the resemblance
to ordinary English text increases quite
noticeably at each depth.” Indeed, these machines were
producing meaningless text, though they contained
approximately the same statistical structure
you’d see in actual English. Shannon then proceeds to
define a quantitative measure of information, as he realizes
that the amount of information in some message must be tied up
in the design of the machine, which could be used to generate
similar looking sequences. Which brings us to his
concept of entropy.