One of the main goals of our project is to reach a deeper understanding of statistical measures that have been introduced and adopted for investigating and analyzing large amounts of textual data in a contrastive perspective. They are usually referred to as ’keyness measures’, as they operate on a lexical level and are used for extracting “key” terms or phrases. We prefer the term ’measures of distinctiveness’, as it better emphasizes that this kind of analysis is about the extraction of characteristic words on the basis of a comparison.

We want to share our knowledge with all who are interested in comparative analysis. On this page you will find an overview of the measures of distinctiveness implemented in our framework, information about their key statistical characteristics and useful references.

NameType of measureReferencesEvaluated in
TF-IDFDispersion-basedLuhn 1957, Spärck 1972Salton & Buckley 1988
Ratio of relative frequenciesFrequency-basedDamerau 1993 Gries 2010
Chi-squared testFrequency-based Dunning 1993Lijffijt et al. 2014
Log-likelihood ratio testFrequency-basedDunning 1993Egbert & Biber, 2019, Paquot & Bestgen 2009, Lijffijt et al. 2014
Welch’s t-testDistribution-basedWelch 1947Paquot & Bestgen 2009 (t-test), Lijffijt et al. 2014
Wilcoxon rank sum testDispersion-based (to some extent)Wilcoxon 1945, Mann & Whitney 1947Paquot & Bestgen 2009, Lijffijt et al. 2014
Burrows’ ZetaDispersion-basedBurrows 2007,Craig & Kinney 2009Schöch et al. 2018
logarithmic ZetaDispersion-basedSchöch et al. 2018Schöch et al. 2021, Du et al. 2021
EtaDispersion-basedDu et al. 2021, based on Gries 2008. Du et al. 2021

Overview of the measures of distinctiveness implemented in our framework