TF-IDF (Term Frequency Inverse Document Frequency) wurde zuerst von Luhn (1957) vorgeschlagen und von Spärck (1972) optimiert. Es liefert die Information darüber, wie wichtig ein Wort für ein Dokument in einer Textsammlung ist. Heute gibt es eine Vielzahl unterschiedlicher Varianten und Anwendungen des tf-idf-Maßes. Ein prominentes Beispiel ist der in der Python-Bibliothek „sklearn“ enthaltene Tf-idf-Vectorizer, der viele nützliche Parameter anbietet. Das in unserem Framework implementierte Tf-idf-Maß basiert auf dieser Anwendung.

Bibliografie

Havrlant, Lukáš, and Vladik Kreinovich, ‘A Simple Probabilistic Explanation of Term Frequency-Inverse Document Frequency (Tf-Idf) Heuristic (and Variations Motivated by This Explanation)’, International Journal of General Systems, 46.1 (2017), 27–36 <https://doi.org/10.1080/03081079.2017.1291635>
Chen, Kewen, Zuping Zhang, Jun Long, and Hao Zhang, ‘Turning from TF-IDF to TF-IGM for Term Weighting in Text Classification’, Expert Systems with Applications, 66 (2016), 245–60 <https://doi.org/10.1016/j.eswa.2016.09.009>
Albitar, Shereen, Sébastien Fournier, and Bernard Espinasse, ‘An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification’, in Web Information Systems Engineering – WISE 2014, ed. by Boualem Benatallah, Azer Bestavros, Yannis Manolopoulos, Athena Vakali, and Yanchun Zhang (Cham: Springer International Publishing, 2014), 105–14 <https://doi.org/10.1007/978-3-319-11749-2_8>
Zhang, Wen, Taketoshi Yoshida, and Xijin Tang, ‘A Comparative Study of TF*IDF, LSI and Multi-Words for Text Classification’, Expert Systems with Applications, 38.3 (2011), 2758–65 <https://doi.org/10.1016/j.eswa.2010.08.066>
Wu, Ho Chung, Robert Wing Pong Luk, Kam Fai Wong, and Kui Lam Kwok, ‘Interpreting TF-IDF Term Weights as Making Relevance Decisions’, ACM Transactions on Information Systems, 26.3 (2008), 1–37 <https://doi.org/10.1145/1361684.1361686>
Achananuparp, Palakorn, Xiaohua Hu, and Xiajiong Shen, ‘The Evaluation of Sentence Similarity Measures’, in Data Warehousing and Knowledge Discovery, ed. by Il-Yeol Song, Johann Eder, and Tho Manh Nguyen (Berlin, Heidelberg: Springer Berlin Heidelberg, 2008), mmmmmclxxxii, 305–16 <https://doi.org/10.1007/978-3-540-85836-2_29>
Yun-tao, Zhang, Gong Ling, and Wang Yong-cheng, ‘An Improved TF-IDF Approach for Text Classification’, Journal of Zhejiang University-SCIENCE A, 6.1 (2005), 49–55 <https://doi.org/10.1007/BF02842477>
Jones, Karen Spärck, ‘IDF Term Weighting and IR Research Lessons’, Journal of Documentation, 2004 <https://doi.org/10.1108/00220410410560591>
Robertson, Stephen, ‘Understanding Inverse Document Frequency: On Theoretical Arguments for IDF’, Journal of Documentation, 60.5 (2004), 503–20 <https://doi.org/10.1108/00220410410560582>
Ramos, Juan Enrique, ‘Using TF-IDF to Determine Word Relevance in Document Queries’, 2003
Church, Kenneth, and William Gale, ‘Inverse Document Frequency (IDF): A Measure of Deviations from Poisson’, in Third Workshop on Very Large Corpora, 1995 <https://www.aclweb.org/anthology/W95-0110> [accessed 12 June 2020]
Robertson, S. E., and S. Walker, ‘Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval’, in SIGIR ’94, ed. by Bruce W. Croft and C. J. van Rijsbergen (Springer London, 1994), pp. 232–41
Spärck Jones, Karen, ‘A Statistical Interpretation of Term Specificity and Its Application in Retrieval.’, Journal of Documentation, 28 (1972), 11–21
Luhn, Hans Peter, ‘A Statistical Approach to Mechanized Encoding and Searching of Literary Information’, IBM Journal of Research and Development, 1.4 (1957), 309–17