Tf-idf (term frequency–inverse document frequency)

The tf–idf is a weight often used in information retrieval. In 1972, Karen Spärck Jones published in the Journal of Documentation the paper which defined the term weighting scheme now known as inverse document frequency (IDF).


"The original exchange in 1972 was part of the stimulus for the development (via a short paper [1] in 1974) for the Robertson/Spärck Jones relevance weighting model of 1976 [2]. However, the circle was not fully closed until the Croft/Harper paper of 1979 [3] which showed IDF as an approximation to RSJ relevance weighting, together with a much later paper [4] which clarified the difference between the Croft/Harper approximation and the original formula. A short technical report [5] summarises the text retrieval methods developed in this framework, and a comprehensive paper [6] covers the combination of IDF weighting with other weighting factors and reports extensive experimental results. " (Robertson, 2005)



F. Sebastiani writes:


"One popular class of statistical term weighting functions is tf * idf (see e.g. Salton & Buckley, 1988) where two intuitions are at play: 

  1. the more frequently tk occurs in dj, the more important for dj is it (the term frequency intuition); 

  2. the more documents tk occurs in, the less discriminating is it, i.e. the smaller its contribution is in characterizing the semantics of a document in which it occurs (the inverse document frequency intuition). “ (Sebastiani, 2003).


The idf measure is also known as statistical specificity.




