The Wilcoxon rank sum test, also known as Mann-Whitney U-test, doesn’t make any assumption concerning the statistical distribution of words in a corpus (Wilcoxon 1945, Mann & Whitney 1947). It is based on a comparison of a sum of rank orders of texts in two text collections. The rank orders of texts are defined according to a frequency of a target word, without considering to which of both corpora this text belongs (see Lijffijt 2014). In our implementation, it sums up the frequencies per segment of document; for this reason, we consider it to be a dispersion-based rather than a frequency-based measure.
2241481 measure_wilcoxon items 1 date desc
Lijffijt, Jefrey, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki, and Heikki Mannila, ‘Significance Testing of Word Frequencies in Corpora’, Digital Scholarship in the Humanities, 31.2 (2014), 374–97 <https://doi.org/10.1093/llc/fqu064>
Paquot, Magali, and Yves Bestgen, ‘Distinctive Words in Academic Writing: A Comparison of Three Statistical Tests for Keyword Extraction’, in Corpora: Pragmatics and Discourse, ed. by Andreas H. Jucker, Daniel Schreier, and Marianne Hundt (Brill | Rodopi, 2009) <https://doi.org/10.1163/9789042029101_014>
Woolson, R. F., ‘Wilcoxon Signed-Rank Test’, in Wiley Encyclopedia of Clinical Trials, ed. by Ralph B. D’Agostino, Lisa Sullivan, and Joseph Massaro (Hoboken, NJ, USA: John Wiley & Sons, Inc., 2008), p. eoct979 <https://doi.org/10.1002/9780471462422.eoct979>
Zimmerman, Donald W., and Bruno D. Zumbo, ‘Relative Power of the Wilcoxon Test, the Friedman Test, and Repeated-Measures ANOVA on Ranks’, The Journal of Experimental Education, 62.1 (1993), 75–86 <https://doi.org/10.1080/00220973.1993.9943832>
Mann, H. B., and D. R. Whitney, ‘On a Test of Whether One of Two Random Variables Is Stochastically Larger than the Other’, The Annals of Mathematical Statistics, 18.1 (1947), 50–60 <https://doi.org/10.1214/aoms/1177730491>