# A mathematical theory of communication | Computer Science | Khan Academy

just finished developing his theories related to
cryptography and therefore was well aware that human
communication was a mix of randomness and
statistical dependencies. Letters in our messages were obviously dependent on previous
letters to some extent. In 1949, he published
a groundbreaking paper, “A Mathematical Theory of Communication”. In it, he uses Markov models as the basis for how we can think about communication. He starts with a toy example. Imagine you encounter a bunch of text written in an alphabet of A, B, and C. Perhaps you know nothing
seem to clump together, while Bs and Cs do not. He then shows that you
could design a machine to generate similar-looking text, using a Markov chain. He starts off with a
zeroth-order approximation, which means we just independently select each symbol A, B, or C at
random, and form a sequence However, notice that this sequence doesn’t look like the original. He shows then you could do a bit better with a first-order approximation, where the letters are
chosen independently, but according to the
probability of each letter in the original sequence. This is slightly better
as As are now more likely, but it still doesn’t
capture much structure. The next step is key. A second-order approximation
takes into account each pair of letters which can occur. In this case, we need three states. The first state represents
all pairs which begin with A, the second all pairs that begin with B, and the third state all
pairs that begin with C. Notice now that the A
cup has many AA pairs, which makes sense, since
the conditional probability of an A after an A is higher
in our original message. We can generate a sequence using this second-order
model easily as follows. We start anywhere and pick a tile, and we write down our
output the first letter, and move to the cup defined
by the second letter. Then we pick a new tile, and repeat this process indefinitely Notice that this sequence
is starting to look very similar to the original message, because this model is capturing the conditional dependencies
between letters. If we wanted to do even better, we could move to a
third-order approximation, which takes into account groups of three letters, or “trigrams”. In this case, we would need nine states. But next, Shannon applies
this exact same logic to actual English text, using statistics that were known for letters,
pairs, and trigrams, etc. He shows the same
progression from zeroth-order random letters to
first-order, second-order and third-order sequences. He then goes on and tries
the same thing using words instead of letters, and he writes “the resemblance to ordinary English text “increases quite
noticeably at each depth.” Indeed, these machines were producing meaningless text, though they contained approximately the same
statistical structure you’d see in actual English. Shannon then proceeds to define a quantitative measure of information, as he realizes that the
amount of information in some message must be tied up in the design of the
machine which could be used to generate similar-looking sequences. Which brings us to his concept of entropy.