Redundancy in the English Language

Whenever we communicate, rules everywhere restrict our freedom to choose the next letter and the next pineapple.I Because these rules render certain patterns more likely and certain patterns almost impossible, languages like English come well short of complete uncertainty and maximal information: the sequence “th” has already occurred 6,431 times in this book, the sequence “tk” just this once. From the perspective of the information theorist, our languages are hugely predictable— almost boring.

To prove it, Shannon set up an ingenious, if informal, experiment in garbled text: he showed how, by playing with stochastic processes, we can construct something resembling the English language from scratch. Shannon began with complete randomness. He opened a book of random numbers, put his finger on one of the entries, and wrote down the corresponding character from a 27- symbol “alphabet” (26 letters, plus a space). He called it “zero-order approximation.” Here’s what happened:

XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.

There are equal odds for each character, and no character exerts a “pull” on any other. This is the printed equivalent of static. This is what our language would look like if it were perfectly uncertain and thus perfectly informative.

But we do have some certainty about English. For one, we know that some letters are likelier than others. A century before Shannon, Samuel Morse (inspired by some experimental rifling through a typesetter’s box of iron characters) had built his hunches about letter frequency into his telegraph code, assigning “E” an easy single dot and “Q” a more cumbersome dash-dash-dot-dash. Morse got it roughly right: by Shannon’s time, it was known that about 12 percent of English text is the letter “E,” and just 1 percent the letter “Q.” With a table of letter frequencies in one hand and his book of random numbers in the other, Shannon restacked the odds for each character. This is “first-order approximation”:

OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL.

More than that, though, we know that our freedom to insert any letter into a line of English text is also constrained by the character that’s just come before. “K” is common after “C,” but almost impossible after “T.” A “Q” demands a “U.” Shannon had tables of these two-letter “digram” frequencies, but rather than repeat the cumbersome process, he took a cruder tack, confident that his point was still made. To construct a text with reasonable digram frequencies, “one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter is recorded, etc.” If all goes well, the text that results reflects the odds with which one character follows another in English. This is “second-order approximation”:

ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.

Out of nothing, a stochastic process has blindly created five English words (six, if we charitably supply an apostrophe and count ACHIN’). “Third-order approximation,” using the same method to search for trigrams, brings us even closer to passable English:

IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.

Not only are two- and three-letter combinations of letters more likely to occur together, but so are entire strings of letters—in other words, words. Here is “first-order word approximation,” using the frequencies of whole words:

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

Even further, our choice of the next word is strongly governed by the word that has just gone before. Finally, then, Shannon turned to “second-order word approximation,” choosing a random word, flipping forward in his book until he found another instance, and then recording the word that appeared next:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

“The particular sequence of ten words ‘attack on an English writer that the character of this’ is not at all unreasonable,” Shannon observed with pride.

Notes:

Monte Carlo method for building words and sentences.

Folksonomies: grammar information science probability redundancy

Taxonomies:
/science/mathematics/geometry (0.527785)
/science/mathematics/arithmetic (0.473010)
/careers/resume writing and advice (0.455667)

Keywords:
English Language Monte (0.940014 (:0.000000)), letter (0.935665 (:0.000000)), second-order word approximation (0.921571 (:0.000000)), random numbers (0.905998 (:0.000000)), first-order word approximation (0.903576 (:0.000000)), Shannon (0.898744 (:0.000000)), certain patterns (0.895794 (:0.000000)), XFOML RXKHRJFFJUJ ZLPWCFWKCYJ (0.888868 (:0.000000)), English text (0.880836 (:0.000000)), reasonable digram frequencies (0.875117 (:0.000000)), easy single dot (0.874811 (:0.000000)), TOBE SEACE CTISBE (0.865549 (:0.000000)), TEASONARE FUSO TIZIN (0.864742 (:0.000000)), CRATICT FROURE BIRS (0.861036 (:0.000000)), character (0.850939 (:0.000000)), english writer (0.845290 (:0.000000)), first-order approximation (0.810836 (:0.000000)), second-order approximation (0.808356 (:0.000000)), book (0.801132 (:0.000000)), Third-order approximation (0.785964 (:0.000000)), corresponding character (0.772779 (:0.000000)), maximal information (0.772571 (:0.000000)), passable English (0.772392 (:0.000000)), cumbersome dash-dash-dot-dash (0.772186 (:0.000000)), stochastic processes (0.771379 (:0.000000)), English words (0.770915 (:0.000000)), complete uncertainty (0.767938 (:0.000000)), information theorist (0.767663 (:0.000000)), Carlo method (0.767465 (:0.000000)), zero-order approximation. (0.767003 (:0.000000)), complete randomness (0.764292 (:0.000000)), experimental rifling (0.764024 (:0.000000)), cumbersome process (0.763161 (:0.000000)), DEAMY ACHIN (0.761752 (:0.000000)), random word (0.760911 (:0.000000)), equal odds (0.760362 (:0.000000)), FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD (0.757809 (:0.000000)), NBNESEBYA TH (0.756828 (:0.000000)), 27- symbol (0.756682 (:0.000000)), count ACHIN (0.756548 (:0.000000)), iron characters (0.753312 (:0.000000)), telegraph code (0.749980 (:0.000000)), OCRO HLI (0.749927 (:0.000000)), Samuel Morse (0.747249 (:0.000000)), letters—in other words (0.744760 (:0.000000)), cruder tack (0.744581 (:0.000000)), three-letter combinations (0.742703 (:0.000000)), ILONASIVE TUCOOWE (0.741722 (:0.000000)), stochastic process (0.739688 (:0.000000)), particular sequence (0.739203 (:0.000000))

Entities:
Shannon restacked:Person (0.890275 (:0.000000)), Samuel Morse:Person (0.324680 (:0.000000)), WRITER:JobTitle (0.322886 (:0.000000)), ANDY TOBE:Person (0.295393 (:0.000000)), 12 percent:Quantity (0.295393 (:0.000000)), 1 percent:Quantity (0.295393 (:0.000000)), one hand:Quantity (0.295393 (:0.000000))

Concepts:
Orders of approximation (0.983020): dbpedia_resource
Randomness (0.676136): dbpedia_resource
Numerical analysis (0.646971): dbpedia_resource
Stochastic process (0.643079): dbpedia_resource
English language (0.613166): dbpedia_resource
Approximation (0.577300): dbpedia_resource
Linguistics (0.544964): dbpedia_resource
United Kingdom (0.540222): dbpedia_resource

 A Mind at Play: How Claude Shannon Invented the Information Age
Books, Brochures, and Chapters>Book:  Soni, Jimmy (2017718), A Mind at Play: How Claude Shannon Invented the Information Age, Retrieved on 2018-07-27
Folksonomies: information science biography