TF-IDF (term frequency inverse document frequency) was first suggested by Luhn (1957) and optimized by Spärck (1972). It weighs how important a word is to a document in a collection of texts. Today, there is a wide range of different variants and applications of the tf-idf measure. One prominent example is the Tf-idf-Vectorizer contained in the Python library “sklearn” that suggests many useful parameters. The Tf-idf measure implemented in our framework is based on this application. 

References

Havrlant, Lukáš, and Vladik Kreinovich, ‘A Simple Probabilistic Explanation of Term Frequency-Inverse Document Frequency (Tf-Idf) Heuristic (and Variations Motivated by This Explanation)’, International Journal of General Systems, 46.1 (2017), 27–36 <https://doi.org/10.1080/03081079.2017.1291635>
Chen, Kewen, Zuping Zhang, Jun Long, and Hao Zhang, ‘Turning from TF-IDF to TF-IGM for Term Weighting in Text Classification’, Expert Systems with Applications, 66 (2016), 245–60 <https://doi.org/10.1016/j.eswa.2016.09.009>
Albitar, Shereen, Sébastien Fournier, and Bernard Espinasse, ‘An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification’, in Web Information Systems Engineering – WISE 2014, ed. by Boualem Benatallah, Azer Bestavros, Yannis Manolopoulos, Athena Vakali, and Yanchun Zhang (Cham: Springer International Publishing, 2014), 105–14 <https://doi.org/10.1007/978-3-319-11749-2_8>
Zhang, Wen, Taketoshi Yoshida, and Xijin Tang, ‘A Comparative Study of TF*IDF, LSI and Multi-Words for Text Classification’, Expert Systems with Applications, 38.3 (2011), 2758–65 <https://doi.org/10.1016/j.eswa.2010.08.066>
Wu, Ho Chung, Robert Wing Pong Luk, Kam Fai Wong, and Kui Lam Kwok, ‘Interpreting TF-IDF Term Weights as Making Relevance Decisions’, ACM Transactions on Information Systems, 26.3 (2008), 1–37 <https://doi.org/10.1145/1361684.1361686>
Achananuparp, Palakorn, Xiaohua Hu, and Xiajiong Shen, ‘The Evaluation of Sentence Similarity Measures’, in Data Warehousing and Knowledge Discovery, ed. by Il-Yeol Song, Johann Eder, and Tho Manh Nguyen (Berlin, Heidelberg: Springer Berlin Heidelberg, 2008), mmmmmclxxxii, 305–16 <https://doi.org/10.1007/978-3-540-85836-2_29>
Yun-tao, Zhang, Gong Ling, and Wang Yong-cheng, ‘An Improved TF-IDF Approach for Text Classification’, Journal of Zhejiang University-SCIENCE A, 6.1 (2005), 49–55 <https://doi.org/10.1007/BF02842477>
Jones, Karen Spärck, ‘IDF Term Weighting and IR Research Lessons’, Journal of Documentation, 2004 <https://doi.org/10.1108/00220410410560591>
Robertson, Stephen, ‘Understanding Inverse Document Frequency: On Theoretical Arguments for IDF’, Journal of Documentation, 60.5 (2004), 503–20 <https://doi.org/10.1108/00220410410560582>
Ramos, Juan Enrique, ‘Using TF-IDF to Determine Word Relevance in Document Queries’, 2003
Church, Kenneth, and William Gale, ‘Inverse Document Frequency (IDF): A Measure of Deviations from Poisson’, in Third Workshop on Very Large Corpora, 1995 <https://www.aclweb.org/anthology/W95-0110> [accessed 12 June 2020]
Robertson, S. E., and S. Walker, ‘Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval’, in SIGIR ’94, ed. by Bruce W. Croft and C. J. van Rijsbergen (Springer London, 1994), pp. 232–41
Spärck Jones, Karen, ‘A Statistical Interpretation of Term Specificity and Its Application in Retrieval.’, Journal of Documentation, 28 (1972), 11–21
Luhn, Hans Peter, ‘A Statistical Approach to Mechanized Encoding and Searching of Literary Information’, IBM Journal of Research and Development, 1.4 (1957), 309–17