Zipf's law
G. K. Zipf proposed in his book
Human Behavior and the Principle of the Least
Effort from 1949 an empirical law on
word frequencies in natural language
speech and texts. Zipf's law states that while only a few words are used very
often, many or most are used rarely.
Zipf's law
states that in a tabulation of the occurrence of all words in a sufficiently
comprehensive text, ranged by their frequency, will the product of rang number and
frequency make up a constant.
In addition will the number of different words in the vocabulary be equal to the
frequency of the most common word (rank number 1).
Zipf's law may be stated
mathematically as:
-
where N is
the number of elements, k is their rank, and s
is the exponent characterizing the distribution. In the example
of the frequency of words in the English language, N is
the number of words in the English language and, if we use the
classic version of Zipf's law, the exponent s is 1.
Zipf also
provided a theoretical explanation for his law. He found that the law was an
expression of a competition between two economic principles: "The economics
of the speaker" that tend towards a reduction of the number of words in
language and "the economy of the listener" that tend to use a
new word in each new linguistic act that the speaker wish to do. In all
persons speaking a language fluently there is a balance and Zipf's law is an
indication that this balance is reached. This balance is not present, however,
by, for example, immigrants, who are in the process of learning a new language.
In other words: Zipf's basic idea was
that there are two opposing forces that guide the evolution of language:
unification and
diversification. From the speaker’s point of view, it is desirable in terms of
effort minimization
to communicate all meaning via a single word or sound. For the listener, it is
desirable to have a different word associated with each separate meaning.
Language evolves, Zipf suggests, in a way that optimizes the cost of
communicative transactions between speakers and listeners.
Zipf's law is
also an expression of more universal regularities. Zipf himself found that his law was valid in relation to
populations in cities as plotted as a function of
the rank (the most popular city is ranked number one, etc). Fedorowicz (1982)
expresses the view that Zipf's law can be applied on phenomena as
different as distributions of income, the size of companies and biological arts
and species. Zipf's law is believed to be equivalent with distributions in laws
formulated by Yule, Lotka, Pareto, Bradford and Price. Zipf's law is often
assumed to be related to other bibliometric laws (cf., for example, Chen & Leimkuhler, 1986; Kunz, 1988).
Zipf's law has
been influential in Library and Information Science (LIS) in, for example,
examinations of whether information
retrieval languages are in accordance with it (cf., for example, Blair, 1990; Egghe, 1991; Fedorowicz, 1982; Ohly,
1982; Wyllys, 1981).
Literature:
Blair; D. C. (1990). Language and Representation in Information Retrieval.
Amsterdam: Elsevier.
Brookes, B. C. (1969). The complete Bradford-Zipf 'Bibliograph'.
Journal of
Documentation, 25(1), 58-60.
Brookes, B. C. (1968). The derivation and application of the Bradford-Zipf distribution.
Journal of Documentation, 24(4), 247-265.
Buckland, M. K. & Hindle, A. (1969). Library Zipf. Journal of Documentation, 25(1),
52-56.
Chen, Y.-S.; Leimkuhler, F. F.
(1986). A relationship between Lotka's law,
Bradford's law, and Zipf's law. Journal of the American Society for Information
Science, 37(5), 307-314.
Egghe, L.: The exact place of Zipf's and Pareto's law amongst the classical
information laws. Scientometrics, 20(1), 1991, 93-106.
Fairthorne, R. A.
(1969). Empirical hyperbolic distributions (Bradford-Zipf-Mandelbrot)
for bibliometric description and prediction. Journal of Documentation, 25(4),
319-343.
Fedorowicz, J.
(1982). The theoretical foundation of Zipf's law and its application
to the bibliographic database environment. Journal of the American Society for
Information Science, 33(5), 285-293.
Gabaix, Xavier (1999).
Zips's law for cities: An explanation.
Quarterly Journal of Economics, 114(3), 739-67.
Available at:
http://econ-www.mit.edu/faculty/download_pdf.php?id=530 (Retrieved
2007-08-16).
Gelbukh,
Alexander & Sidorov, Grigori
(2004). Zipf and Heaps Laws’ Coefficients Depend on Language. Proc.
CICLing-2001, Conference on Intelligent Text Processing and Computational
Linguistics, February 18–24, 2001, Mexico City. Lecture Notes in Computer
Science N 2004, ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag, pp.
332–335. Retrieved 2007-08-16 from:
http://www.gelbukh.com/CV/Publications/2001/CICLing-2001-Zipf.htm
Kali R. (2003). The city as a giant component: a random graph
approach to Zipf's law. Applied Economics Letters,
10(11), 717-720.
Kunz, M.: Lotka and Zipf: paper dragons with fuzzy tails.
Scientometrics,
13(5-6), 1988, 289-297.
Nicholls, P. T.
(1987). Estimation of Zipf parameters. Journal of the American
Society for Information Science, 38(6), 443-445.
Ohly, H. Peter: A procedure for comparing documentation language applications:
the transformed Zipf curve. International Classification, 9(3), 1982, 125-128.
Wyllys, R. E.
(1981). Empirical and theoretical bases of Zipf's Law. Library
Trends, 30(1), 53-64.
Zipf, G. K. (1932). Selected Studies of the Principle
of Relative Frequencies of Language. Cambridge, Massachusetts: Harvard
University Press.
Zipf, G. K. (1949). Human behaviour and the Principle
of the Least Effort. Reading, MA:
Addison-Wesley.
See also:
Bibliometrics
Birger Hjørland
Last edited:
18-08-2007
Home