In 1949, George Kingsley Zipf published the book "Human Behavior and the Principle of Least Effort" in which he develops what became known as "Zipf's law."
Zipf was one of the first to do research in word frequency. A linguist by profession, for some reason he decided to study statistical patterns of word frequency in texts. He ended up developing a law based on efficiency and effort that proved to be true across an astonishingly wide range of writings both in English and in other languages, and was even extended beyond language to other human activities.[*]
His law essentially says (if I may reduce it down to an essence) that for reasons of efficiency, some words are used often, in different combinations and with different meanings, while more specialized words become the long tail of speech and writing. Much of "Human Behavior" consists of detailed formulas and charts based on the few full texts that he had to work with. All show that the same statistical curve was obtained with some words being very high frequency and the curve dropping off rapidly until the proverbial long tail was arrived at. The figures themselves could be seen as an interesting glitch in linguistic reality but what made Zipf’s law powerful was his explanation of the reason for the consistency of the data.
Zipf explained his theory using a workbench as a metaphor. In a well-organized workbench, the tools that were kept closest to the worker would be those with the widest uses, the proverbial hammer that can be used for a great number of chores. In that close circle would be tools that could also be combined to tackle tasks that neither could complete alone. As you would move further out from the center, you would find tools that were used less often but that become increasingly specialized in their functions. Presumably you would eventually arrive at a tool that could do only one job, a tool that you almost never needed.
His pioneering work made use, of all things, of a concordance of James Joyce's Ulysses, a text that contains many words that are not in my toolbox. Joyce was notably fanciful in his use of language, and yet the statistical measure that Zipf obtained from Ulysses was comparable to measures obtained from other texts, including some in the Chinese language.
Zipf showed, to my mind, that the more frequent a word is the less it tells you about its meaning in the text in which it is found. A brick doesn't tell you what the house looks like.
Later research into term frequency failed to understand what Zipf was saying and took
a high term frequency to indicate that the word in question was of importance
related to the number of appearances in a text. This is clearly not what he
said and a quick way to realize that is to realize that the first text that he
ran his theory on was James Joyce’s Ulysses - a text that defies prediction. (The current AI "predictive" language can produce the upper echelons of Zipf's frequency curve, but not the beauty of the Joycean long tail.)
It seems obvious that an individual string of characters unbroken by whitespace is not necessarily a unit of meaning. Think about the meanings of "ice" and "cream" and how these differ from the meaning of "ice cream". This is more of a problem in some languages that others. Some languages, like Italian, have a single word ("gelato") for that concept; others, like German, have a way of avoiding the whitespace split ("Eiscreme") by putting the elements of "eis" (ice) and "creme" (cream) together into a single word. (I am beginning to think that the German language method of creating composite words that create a single unit for a concept is superior to the English use of using separate words for composite concepts.)
The problem that I see is that while you can observe human activity, including writing, and turn that observation into numbers, it's rarely clear what the numbers represent besides, well, being numbers. This is even true in the area of linguistics that was Zipf's main bailiwick. We are so accustomed to the definition of "word" as any characters bookended with some whitespace character that we fail to think about the great complexity that language presents and that the use of whitespace in writing has its purposes but does not define meaning.
-----
[*] Zipf's law has been used to describe the relative sizes of cities in a country [1], in genetics [2], company income distribution [3], and more.
[1] Xavier Gabaix, Zipf's Law for Cities: An Explanation, The Quarterly Journal of Economics, Volume 114, Issue 3, August 1999, Pages 739–767, https://doi.org/10.1162/003355399556133
[2]Furusawa, Chikara, and Kunihiko Kaneko. "Zipf’s law in gene expression." Physical review letters 90, no. 8 (2003): 088102.