Information retrieval evaluation

An important milestone in the evaluation of information retrieval systems was the Cranfield experiments, in which the measurement of recall and precision was first established (cf., Cleverdon, 1967). Many alternatives have been suggested, including fallout (the proportion of returned documents out of those nonrelevant). Recall and precision have, however, turned out to be rather tenacious of life. A core problem in such experiments is connected to the concept of relevance.

 

"The Cranfield model has not lost its appeal, but researchers now typically employ two or more measurement and evaluation methods in studies of interactive retrieval systems. These include transaction log analysis, questionnaires, interviews, video-based observation, and classic recall and precision analysis" (Hildreth, 2001).

 

F-measure (Van Rijsbergen, 1979) combines recall and precision in a single efficiency measure (it is the harmonic mean of precision and recall):

 

 F = 2 * (recall * precision) / (recall + precision)

 

"In addition to quality measures of output, several other factors have become the focus of evaluation efforts. Among these are ease of use, system browsability, system efficiency, and satisfaction with the system and the search experience as a whole." (Hildreth, 2001).

 

IN TREC is evaluation done by pooling documents retrieved by the participating sites in a track. Each participating site submits 1000 top-ranked documents retrieved by their system. Documents returned by the participants are then pooled and top n (e.g., 100) documents submitted by each participant are evaluated by the assessors invited by the National Institute of Standards and Technology (NIST). Standard metrics of recall, precision and F-measure are usually used to evaluate the effectiveness of the systems participating in the experiments. (cf., Voorhees, 2005).

 

 

 


 


Literature:

 

Cleverdon, C. (1967). The Cranfield tests on English language devices. Aslib Proceedings 19(6), 173-194.
 

Griffith, B. C.; White, H. D.; Drott, M. C. & Saye, J. D. (1984). An Analysis of the National Library of Medicine's (NLM) Handling of the Medical Behavioral Sciences' (MBS) Literatures: Some Research Tests of Methods for Evaluating Bibliographic Databases. Philadelphia, Pennsylvania. 33 pp. (mimeo.)
 

Hildreth, C. R. (2001). Accounting for users' inflated assessments of on-line catalogue search performance and usefulness: an experimental study. Information Research, 6(2) Available at: http://InformationR.net/ir/6-2/paper101.html

 

Hjørland, B. (1977). Evaluering af Informationsgenfindingssystemer. Pp. 83-108 IN: Informationsvidenskab. 1978 kompendium for sektion II/ 2.del. Ved Birger Hjørland. København: Danmarks Biblioteksskole, 1977. 150 p. (upublished compendium).
 

Klawiter-Pommer, Jutta H.T. & Wolf D. Hoffmann: Übersicht über die füer den Leistungvergleich mehrerer Literatur-Datenbasen wichtigsten Parameter: unique relevant references, recall, precision, miss-ratio, noise-ratio, fall-out-ratio, novelty, extension-ratio, serendipity, insufficiency. Nachricht für Dokumentation, 27(3), 1976, 103-108.
 

Lancaster, F. W. (1968). Evaluation of the MEDLARS Demand Search Service. Washington: National Library of Medicine.
 

Lancaster, F. W. (1979). Information Retrieval Systems: characteristics, testing and evaluation. 2.ed., New York: Wiley-Interscience.
 

Lancaster, F. W. (1991). "Consistency of Indexing" & "Quality of Indexing" (pp. 60-73 + 74-85 IN: Lancaster, F. W.: Indexing and Abstracting in Theory and Practice. London: Library Association).

 

Saracevic, T. & Kantor, P. (1988). A study of information Seeking and Retrieving II. Users, Questions, and Effectiveness. Journal of the American Society for Information Science, 39(3), 177-196.

 

Van Rijsbergen, C. J. (1979). Information Retrieval. 2nd edition. London: Butterworths. Available at: http://www.dcs.gla.ac.uk/Keith/Preface.html

 

Voorhees E. M. (2005). Overview of TREC 2004. IN: Voorhees, E., Buckland, L. (Eds.) Proceedings of the 13th Text Retrieval Conference, November 16-19, 2004, (TREC 2004). Gaithersburg, MD.

 

Warner, J. (1992). Retrieval performance tests in relation to online bibliographic searching. In D. Shaw (Ed.), ASIS ‘92: Celebrating Change: Information Management on the Move (Proceedings of the 55th ASIS Annual Meeting, Pittsburgh, PA, 26-29 October 1992) (pp.231-241). Medford, NJ: Learned Information for the American Society for Information Science.

 


See also: Cranfield experiments;  Evaluation in Knowledge OrganizationInformation systems evaluation; TREC

 

 

 

Birger Hjørland

Last edited: 14-10-2006

Home

 

to be edited:

 Der har ofte været foreslået alternativer til de traditionelle mål: *"Recall" og *"precision", men de har vist sig meget sejlivede. Nogle forskere arbejder således med målet fallout, der er et mål for genfindingsydelse baseret på forholdet mellem antal ikke-relevante fremfundne poster og det totale antal non-relevante poster i databasen. Et kerneproblem i evalueringen knytter sig til begrebet *relevans.

De første forsøg indenfor området, de såkaldte "Cranfield"-eksperimenter (Cleverdon, 1967), resulterede i den fatalistiske opfattelse, at der indenfor et IR-system eksisterer et uundgåeligt dilemma mellem "recall" og "precision" på den måde, at hvis man forbedre den ene parameter, så forværrer man samtidig den anden. Denne hypotese blev udtrykt som en lov for informationssøgning: "Loven om det inverse forhold mellem "recall" og "precision"".

Nyere undersøgelser (se f.eks. Saracevic & Kantor, 1988) har ikke bekræftet denne "lov", men har derimod fundet en positiv korrelation mellem højere "recall" og højere "precision".

Klawiter-Pommer (1976) fremhæver, at databaser bør forsyne brugerne med pålidelige, komplette, klare og punktuelle informationer. I evalueringen af databaser fremhæves disses evne til at producere unikke relevante referencer, recall og precision som de tre vigtigste mål. Herudover diskuteres andre egenskaber som *serendipity.