Text categorization

"Text categorization is a machine learning approach, in which also information retrieval methods are applied. It involves manually categorizing a number of documents to pre-defined categories (which normally lack devices for the control of polysemy, synonymy and homonymy). By learning the characteristics of those documents the automated categorization of new documents takes place. Text categorization is known as supervised learning, since the process is 'supervised' by learning categories' characteristics from manually categorized documents". (Golub, p. 52).

 

 

 

 

Literature:

 

Aphinyanaphongs Y.; Tsamardinos I; Statnikov A; Hardin, D; Aliferis CF (2005). Text categorization models for high-quality article retrieval in internal medicine. Journal of the American Medical Informatics Association, 12(2), 207-216.

 

Golub, K. (2005). Automated subject classification of textual web pages, for browsing. Lund: Lund University, Department of Information Technology.

 

Ko Y.; Park J; Seo J (2004). Improving text categorization using the importance of sentences. Information Processing & Management, 40(1), 65-79.

Kwon, O. W. & Lee, J. H. (2003). Text categorization based on k-nearest neighbor approach for Web site classification. Information Processing & Management, 39(1), 25-44.

 

Lam W.; Ruiz M; Srinivasan P. (1999). Automatic text categorization and its application to text retrieval. IEEE Transactions on Knowledge and Data Engineering, 11(6), 865-879.

 

Moens, M. F. & Dumortier, J. (2000). Text categorization: the assignment of subject descriptors to magazine articles. Information Processing & Management, 36(6), 841-861.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.
http://www.math.unipd.it/~fabseb60/Publications/ACMCS02.pdf

Stamatatos E.; Kokkinakis G; Fakotakis N. (2000). Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4), 471-495.

Yang, Y. M. & Wilbur, J. (1996). Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, 47(5), 357-369.

 

 

 

See also: Cluster & clustering

 

 

 

 

Birger Hjørland

Last edited: 03-05-2006

Home