The 75% Ghost: How Claude Shannon Predicted the Limits of AI in 1950
Why modern LLMs are finally solving a "guessing game" played by humans 75 years ago.
We spend a lot of time these days thinking about “next token prediction.” It’s the engine behind every LLM, from GPT-4 to the local Llama instance running on your laptop. But it’s fascinating to realize that the theoretical limit of this technique was mapped out 75 years ago, long before we had GPUs to crunch the numbers.
I was revisiting Claude Shannon’s 1950 paper, “Prediction and Entropy of Printed English”, and it really struck me how modern his thinking was. He essentially treated the human brain as a language model to calculate the information density of English.
If you’re building AI today, this paper is the grandfather of your loss function. Let's break down what he did and why it still matters.
The Problem: How dense is English?
Shannon defined entropy (H) as the average number of bits required to encode a letter of text in the most efficient way possible.
If every letter was random (uniform probability), entropy would be log2(26)≈4.7 bits per letter.
If you count letter frequencies (E is common, Z is rare), it drops to 4.14 bits.
If you look at pairs (digrams like “TH” or “QU”), it drops to 3.56 bits.
But Shannon hit a wall. He didn't have the compute to calculate probabilities for 4-grams, 5-grams, or entire sentences. He knew that language has "long-range statistics"—influences extending over paragraphs and chapters—that reduce entropy further.
The Solution: The “Shannon Game”
Since he couldn’t build a neural network, Shannon used the most advanced language processor available in 1950: a human.
He devised an experiment where he would show a subject a sequence of text (N letters) and ask them to guess the next letter.
If the subject guessed right, they moved on.
If they guessed wrong, they were told the correct letter and recorded the error.
He essentially manualized the Cross-Entropy Loss we use today. By measuring how many guesses it took a human to get the right letter, he could estimate the upper and lower bounds of the language's true entropy.
The Finding: 75% Redundancy
His results were staggering. When humans had access to 100 characters of context, the entropy of English dropped to roughly 1 bit per letter.
This implies that English is about 75% redundant. But what does that actually mean? It’s a mathematical statement about efficiency. If a purely random alphabet requires 4.7 bits per letter, and English only requires 1 bit, then ≈3.7 bits of every letter are "wasted" space, statistically speaking. They exist not to convey new information, but to provide structure (grammar, spelling rules) that helps us correct errors.
In simple terms: You could delete 75% of the letters in a book, and a perfect model (or a very patient human) could theoretically reconstruct the whole thing perfectly.
The “Identical Twin” & The Ideal Predictor
To explain why prediction allows for compression, Shannon used a brilliant thought experiment involving “Identical Twins”.
Imagine you have a mathematical twin who thinks exactly like you. You are the sender, and the twin is the receiver.
You look at the text. Instead of sending the letter “E”, you guess the letter.
If “E” is the most likely letter (your #1 guess), you just send the number 1.
Your twin receives the number 1. Since they think exactly like you, they make their #1 guess, which is also “E”.
If you are both "Ideal Predictors," you almost always exchange the number 1. Shannon showed that this "Reduced Text" (a stream of 1s, 2s, and 3s) has a vastly unequal probability distribution, which makes it incredibly easy to compress.
This “ranking” mechanism is exactly what the Softmax layer does in modern Transformers.
Logits are the raw guess scores.
Softmax converts them to probabilities.
Top-K is us selecting the rank.
Shannon essentially described the mechanics of torch.softmax and compression-by-prediction 70 years before we wrote the code for it.
A Surprising Detail: Reverse Prediction
One detail from the paper that often gets overlooked is that Shannon also tested reverse prediction. He asked subjects to guess the previous letter given the next letters.
You might expect this to be much harder—we speak forwards, not backwards. But Shannon found the results were "only slightly poorer". The entropy of the language is almost symmetric.
This is a fascinating point for modern AI. While GPT (Causal) models dominate right now, Shannon’s math suggests that bidirectional understanding (like BERT) is just as statistically valid and perhaps necessary to fully capture the “redundancy” of language.
Why this matters for LLMs (The “Plateau”)
This connects directly to training. The Cross-Entropy Loss function we use is trying to minimize the difference between the Model’s probability distribution and the True distribution of English.
We often say that when a model’s loss plateaus, we are “hitting the limit.” But there’s a nuance here.
The Floor: The “Intrinsic Entropy” (Shannon’s ~1 bit) is the hard mathematical floor. You cannot predict better than the inherent noise of the data.
The Plateau: When your validation loss flattens out, you haven’t necessarily hit Shannon’s 1-bit floor. You’ve likely just hit the capacity limit of your specific architecture.
We are constantly chasing Shannon’s limit, but a plateau usually means we need a better model, not that we’ve “solved” the language.
The Takeaway
When Shannon found that entropy drops to ~1 bit with long context, he proved that language is highly structured but deceptively complex.
For us engineers, it’s a reminder: We aren’t just teaching models to “guess.” We are teaching them to recover the latent structure of human thought. Shannon used humans to prove the limit existed; we use Transformers to try and reach it.
Conclusion
Shannon’s 1950 paper provides the theoretical groundwork for the statistical modeling of language. By quantifying the redundancy of English at approximately 75% and defining the “Ideal Predictor,” he established the mathematical bounds that modern large language models operate within.
While Shannon relied on human subjects to estimate these values, today's Transformers optimize against the same information-theoretic limits. Understanding this historical context offers a clearer perspective on the mechanics of loss functions and the inherent constraints of next-token prediction.

