Stemming

In natural language processing (NLP) is stemming techniques used to create sets of words derived from a common root and appearing in a variety of forms, depending on particular functions in a sentence or variations in meaning.

 

Anderson & Pérez-Carballo provide the following example:

 

index

indexes

indexer

indexable

indices

 

"Stemming was developed to automatically remove certain common suffixes, or word endings (and sometimes prefixes, like "re" or "re-" as in "re-indexing") in order to increase the count for important words, and also in order to find word occurences when the word form in the text does not mach the word form in the search statement" (Anderson & Pérez-Carballo, 2001, 260).

 

Different stemmers have been developed. The most primitive just removes the "s" used in English to make words plural. The problem is, however, that such a primitive stemmer also removes an s in words like "business" and "mathematics" and that it is not sufficient in words like "tomatoes". Of this reason has much more complicated stemmers been proposed, for example, named after their creators, the Lovins and the Porter stemmer (cf., Lovins, 1968 and Porter, 1980).

 

Anderson & Pérez-Carballo (2001) cite diverging evidence about the the question if research results on the average are able to improve performance.

 

Lemmatisation is closely related to stemming. See further in the entry word (Lifeboat for KO).

 

 

 

Literature:

 

Anderson, J. D. & Pérez-Carballo, J. (2001). The nature of indexing: How humans and machines analyze messages and texts for retrieval. Part II: Machine indexing, and the allocation of human versus machine effort. Information Processing & Management, 37, 255-277.

 

Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7-15.

 

Harman, D. (1994). Automatic indexing. IN: Fidel, R.; Hahn, T. B.; Rasmussen, E. M. & Smith, P. J. (Eds.). Challenges in indexing electronic texts and images. (pp. 247-264). Medford, NJ: Learned Information.

 

Krovetz, R. (1993). Viewing morphology as an interference process. IN: Korfhage, R.; Rasmussen, E. & Willett, P. (Eds.). Proceedings of the 16th annual international ACM-SIGIR conference on research and development in information retrieval (pp. 191-202); 27 June-1 July 1993; Pittsburgh, PA. New York: Association for Computing Machinery. (Also available as UMass technical report TR-93-35).

 

Lovins, J. B. (1968). Development of a Stemming Algorithm. Mechanical Translation and computation Linguistics, 11(1), 23-31.
 

Porter, M. F. (1980). An algorithm for suffix-stripping. Program, 14, 130-137. Available: http://www.tartarus.org/martin/PorterStemmer/def.txt

 

Wikipedia. The free encyclopedia. (2006). Stemmer. http://en.wikipedia.org/wiki/Stemmer

 

 

See also: Truncation; Word

 

 

 

 

 

 

 

 

Birger Hjřrland

Last edited: 01-05-2006

Home