Measures of distinctiveness – Zeta and Company

One of the main goals of our project is to reach a deeper understanding of statistical measures that have been introduced and adopted for investigating and analyzing large amounts of textual data in a contrastive perspective. They are usually referred to as ’keyness measures’, as they operate on a lexical level and are used for extracting “key” terms or phrases. We prefer the term ’measures of distinctiveness’, as it better emphasizes that this kind of analysis is about the extraction of characteristic words on the basis of a comparison.

We want to share our knowledge with all who are interested in comparative analysis. On this page you will find an overview of the measures of distinctiveness implemented in our framework, information about their key statistical characteristics and useful references.

Name	Type of measure	References	Evaluated in
TF-IDF	Dispersion-based	Luhn 1957, Spärck 1972	Salton & Buckley 1988
Ratio of relative frequencies	Frequency-based	Damerau 1993	Gries 2010
Chi-squared test	Frequency-based	Dunning 1993	Lijffijt et al. 2014
Log-likelihood ratio test	Frequency-based	Dunning 1993	Egbert & Biber, 2019, Paquot & Bestgen 2009, Lijffijt et al. 2014
Welch’s t-test	Distribution-based	Welch 1947	Paquot & Bestgen 2009 (t-test), Lijffijt et al. 2014
Wilcoxon rank sum test	Dispersion-based (to some extent)	Wilcoxon 1945, Mann & Whitney 1947	Paquot & Bestgen 2009, Lijffijt et al. 2014
Burrows’ Zeta	Dispersion-based	Burrows 2007,Craig & Kinney 2009	Schöch et al. 2018
logarithmic Zeta	Dispersion-based	Schöch et al. 2018	Schöch et al. 2021, Du et al. 2021
Eta	Dispersion-based	Du et al. 2021, based on Gries 2008.	Du et al. 2021

Overview of the measures of distinctiveness implemented in our framework