One of the main goals of our project is to reach a deeper understanding of statistical measures that have been introduced and adopted for investigating and analyzing large amounts of textual data in a contrastive perspective. They are usually referred to as ’keyness measures’, as they operate on a lexical level and are used for extracting “key” terms or phrases. We prefer the term ’measures of distinctiveness’, as it better emphasizes that this kind of analysis is about the extraction of characteristic words on the basis of a comparison.
We want to share our knowledge with all who are interested in comparative analysis. On this page you will find an overview of the measures of distinctiveness implemented in our framework, information about their key statistical characteristics and useful references.
Name | Type of measure | References | Evaluated in |
TF-IDF | Dispersion-based | Luhn 1957, Spärck 1972 | Salton & Buckley 1988 |
Ratio of relative frequencies | Frequency-based | Damerau 1993 | Gries 2010 |
Chi-squared test | Frequency-based | Dunning 1993 | Lijffijt et al. 2014 |
Log-likelihood ratio test | Frequency-based | Dunning 1993 | Egbert & Biber, 2019, Paquot & Bestgen 2009, Lijffijt et al. 2014 |
Welch’s t-test | Distribution-based | Welch 1947 | Paquot & Bestgen 2009 (t-test), Lijffijt et al. 2014 |
Wilcoxon rank sum test | Dispersion-based (to some extent) | Wilcoxon 1945, Mann & Whitney 1947 | Paquot & Bestgen 2009, Lijffijt et al. 2014 |
Burrows’ Zeta | Dispersion-based | Burrows 2007,Craig & Kinney 2009 | Schöch et al. 2018 |
logarithmic Zeta | Dispersion-based | Schöch et al. 2018 | Schöch et al. 2021, Du et al. 2021 |
Eta | Dispersion-based | Du et al. 2021, based on Gries 2008. | Du et al. 2021 |
Overview of the measures of distinctiveness implemented in our framework