# Information Theory part 11: Claude Shannon: A Mathematical Theory of Communication

Now, Shannon had just finished

developing his theories related to cryptography

and therefore was well aware that human

communication was a mix of randomness and

statistical dependencies. Letters in our

messages were obviously dependent on previous

letters to some extent. And in 1949, he published

a groundbreaking paper, a mathematical theory

of communication. And in it, he uses Markov

models as the basis for how we can think

about communication. And he starts with

a toy example. Imagine you encounter a

bunch of texts written in an alphabet of

A, B, and C. Perhaps you know nothing

about this language, but you notice As seem to clump

together while B and Cs do not. He then shows that you

could design a machine to generate similar looking

text using a Markov chain. And he starts off with a

zeroth-order approximation, which means we just

independently select each symbol, A, B, or C at

random, and form a sequence. However, notice

that this sequence doesn’t look like the original. He shows then you

could do a bit better with a first-order approximation

where the letters are chosen independently,

but according to the probability

of each letter in the original sequence. So this is slightly better,

as As are now more likely, but it still doesn’t

capture much structure. The next step is key. A second-order

order approximation takes into account each pair

of letters which can occur. And in this case, we

need three states. The first state

represents all pairs which began with A, the second,

all pairs that begin with B, and the third state, all

pairs that begin with C. And notice now that the A

cup as many AA pairs, which makes sense since the

conditional probability of an A after an A is higher in

our original message. Now we can generate a sequence

using this second-order model easily, as follows. We start anywhere

and pick a tile, and we write down or

output the first letter and move to the cup defined

by the second letter. Then we pick a new tile

and repeat this process indefinitely. Notice that this

sequence is starting to look very similar to

the original message, because this model is capturing

the conditional dependencies between letters. And if we want to

do even better, we could move to a third

order approximation which takes into account groups of

three letters, or trigrams. And in this case, we

would need nine states. But next, Shannon applies

this exact same logic to actual English

texts, using statistics that were known for letters,

pairs, and trigrams, et cetera. And he shows the

same progression from zeroth-order,

random letters, to first-order, second-order,

and third-order sequences. He then goes on and

tries this same thing using words instead of letters. And he writes “the resemblance

to ordinary English text increases quite

noticeably at each depth.” Indeed, these machines were

producing meaningless text, though they contained

approximately the same statistical structure

you’d see in actual English. Shannon then proceeds to

define a quantitative measure of information, as he realizes

that the amount of information in some message must be tied up

in the design of the machine, which could be used to generate

similar looking sequences. Which brings us to his

concept of entropy.