The Chi-squared test and Log-likelihood ratio test are somewhat more sophisticated statistical distribution tests with underlying hypothesis tests. These measures are widely used in CL and implemented in some corpus analysis tools, such as WordSmith Tools (Scott 1997), Wmatrix (Rayson et al. 2009), and AntConc (Anthony 2005). One problem with these measures is that p-values tend to be very low across the board, because two text corpora couldn’t be equal. The more important problem, however, is that they are designed to compare statistically independent events and handle corpora as a bag of words. These tests use the total number of words in the corpus and don’t consider an uneven distribution of words within a corpus (Lijffijt 2014).


Peters, Christine, ‘Text Mining, Travel Writing, and the Semantics of the Global. An AntConc Analysis of Alexander von Humboldt’s Reise in Die Aequinoktial-Gegenden Des Neuen Kontinents’, in Digital Methods in the Humanities: Challenges, Ideas, Perspectives (Bielefeld: Bielefeld University Press, 2021), pp. 185–215
Stefanowitsch, Anatol, ‘Text [Keyword Analysis]’, in Corpus Linguistics: A Guide to the Methodology, Textbooks in Language Sciences, 7 (LangSci Press, 2020), pp. 353–96
Pojanapunya, Punjaporn, and Richard Watson Todd, ‘Log-Likelihood and Odds Ratio: Keyness Statistics for Different Purposes of Keyword Analysis’, Corpus Linguistics and Linguistic Theory, 14.1 (2018), 133–67 <>
Froehlich, Heather, ‘Corpus Analysis with Antconc’, Programming Historian, 2015 <> [accessed 15 February 2021]
Lijffijt, Jefrey, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki, and Heikki Mannila, ‘Significance Testing of Word Frequencies in Corpora’, Digital Scholarship in the Humanities, 31.2 (2014), 374–97 <>
Brezina, Vaclav, and Miriam Meyerhoff, ‘Significant or Random?: A Critical Review of Sociolinguistic Generalisations Based on Large Corpora’, International Journal of Corpus Linguistics, 19.1 (2014), 1–28 <>
Paquot, Magali, and Yves Bestgen, ‘Distinctive Words in Academic Writing: A Comparison of Three Statistical Tests for Keyword Extraction’, in Corpora: Pragmatics and Discourse, ed. by Andreas H. Jucker, Daniel Schreier, and Marianne Hundt (Brill | Rodopi, 2009) <>
Chen, Francine R., Thorsten H. Brants, and Annie E. Zaenen, ‘Systems and Methods for Sentence Based Interactive Topic-Based Text Summarization’, 2008 <> [accessed 17 September 2019]
Rayson, Paul, ‘From Key Words to Key Semantic Domains’, International Journal of Corpus Linguistics, 13.4 (2008), 519–49 <>
Anthony, Laurence, ‘AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom’, 2005, pp. 729–37 <>
Rayson, Paul, ‘Wmatrix: A Web-Based Corpus Processing Environment.’ (Lancaster, UK: Computing Department, Lancaster University, 2005)
Kilgarriff, Adam, ‘Language Is Never, Ever, Ever, Random’, Corpus Linguistics and Linguistic Theory, 1.2 (2005), 263–76 <>
Gamon, Michael, ‘Linguistic Correlates of Style: Authorship Classification with Deep Linguistic Analysis Features’, in Proceedings of the 20th International Conference on Computational Linguistics, COLING ’04 (Stroudsburg, PA, USA: Association for Computational Linguistics, 2004) <>
Rayson, Paul, and Roger Garside, ‘Comparing Corpora Using Frequency Profiling’, in Proceedings of the Workshop on Comparing Corpora - Volume 9, WCC ’00 (Stroudsburg, PA, USA: Association for Computational Linguistics, 2000), pp. 1–6 <>
Scott, Mike, ‘PC Analysis of Key Words and Key Key Words’, System, 25.2 (1997), 233–45 <>
Dunning, Ted, ‘Accurate Methods for the Statistics of Surprise and Coincidence’, Computational Linguistics, 19.1 (1993), 14 <>
Cressie, Noel A. C., and Timothy R. C. Read, ‘Pearsons-X2 and the Loglikelihood Ratio Statistic-G2: A Comparative Review’, 1989 <>
Woolf, Barnet, ‘The Log-Likelihood Ratio Test (the G-Test)’, Annals of Human Genetics, 21.4 (1957), 397–409 <>