LLMs are Lossy Compression for the Entire WWW

To grasp the proposed relationship between compression and understanding, imagine that you have a text file containing a million examples of addition, subtraction, multiplication, and division. Although any compression algorithm could reduce the size of this file, the way to achieve the greatest compression ratio would probably be to derive the principles of arithmetic and then write the code for a calculator program. Using a calculator, you could perfectly reconstruct not just the million examples in the file but any other example of arithmetic that you might encounter in the future. The same logic applies to the problem of compressing a slice of Wikipedia. If a compression program knows that force equals mass times acceleration, it can discard a lot of words when compressing the pages about physics because it will be able to reconstruct them. Likewise, the more the program knows about supply and demand, the more words it can discard when compressing the pages about economics, and so forth.

Large language models identify statistical regularities in text. Any analysis of the text of the Web will reveal that phrases like “supply is low” often appear in close proximity to phrases like “prices rise.” A chatbot that incorporates this correlation might, when asked a question about the effect of supply shortages, respond with an answer about prices increasing. If a large language model has compiled a vast number of correlations between economic terms—so many that it can offer plausible responses to a wide variety of questions—should we say that it actually understands economic theory? Models like ChatGPT aren’t eligible for the Hutter Prize for a variety of reasons, one of which is that they don’t reconstruct the original text precisely—i.e., they don’t perform lossless compression. But is it possible that their lossy compression nonetheless indicates real understanding of the sort that A.I. researchers are interested in?

Notes:

Folksonomies: ai llm large language model

Taxonomies:
/science/mathematics/arithmetic (0.978318)

Concepts:
Lossless compression (0.985698): dbpedia_resource
Lossy compression (0.933582): dbpedia_resource
Data compression (0.916028): dbpedia_resource
Arithmetic (0.885651): dbpedia_resource
Subtraction (0.878060): dbpedia_resource
Algorithm (0.868547): dbpedia_resource
Physics (0.864630): dbpedia_resource
Multiplication (0.858646): dbpedia_resource

 ChatGPT Is a Blurry JPEG of the Web
Electronic/World Wide Web>Internet Article:  Chiang, Ted (February 9, 2023), ChatGPT Is a Blurry JPEG of the Web, Retrieved on 2023-03-27
  • Source Material [www.newyorker.com]
  • Folksonomies: ai chat-gpt large language models llm