Text categorization
"Text categorization is a machine learning approach, in which also information retrieval methods are applied. It involves manually categorizing a number of documents to pre-defined categories (which normally lack devices for the control of polysemy, synonymy and homonymy). By learning the characteristics of those documents the automated categorization of new documents takes place. Text categorization is known as supervised learning, since the process is 'supervised' by learning categories' characteristics from manually categorized documents". (Golub, p. 52).
Literature:
Aphinyanaphongs Y.; Tsamardinos I; Statnikov A; Hardin, D; Aliferis CF (2005). Text categorization models for high-quality article retrieval in internal medicine. Journal of the American Medical Informatics Association, 12(2), 207-216.
Golub, K. (2005). Automated subject classification of textual web pages, for browsing. Lund: Lund University, Department of Information Technology.
Ko Y.; Park J; Seo J (2004). Improving text categorization
using the importance of sentences. Information Processing & Management, 40(1), 65-79.
Kwon, O. W. & Lee, J. H. (2003). Text categorization based on k-nearest neighbor
approach for Web site classification. Information Processing & Management, 39(1),
25-44.
Lam W.; Ruiz M; Srinivasan P. (1999). Automatic text categorization and its application to text retrieval. IEEE Transactions on Knowledge and Data Engineering, 11(6), 865-879.
Moens, M. F. & Dumortier, J. (2000). Text categorization:
the assignment of subject descriptors to magazine articles. Information
Processing & Management, 36(6), 841-861.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM
Computing Surveys, 34(1), 1-47.
http://www.math.unipd.it/~fabseb60/Publications/ACMCS02.pdf
Stamatatos E.; Kokkinakis G; Fakotakis N. (2000). Automatic text categorization
in terms of genre and author. Computational Linguistics, 26(4), 471-495.
Yang, Y. M. & Wilbur, J. (1996). Using corpus statistics to remove redundant
words in text categorization. Journal of the American Society for Information
Science, 47(5), 357-369.
See also: Cluster & clustering
Birger Hjørland
Last edited: 03-05-2006