The Chi-squared test and Log-likelihood ratio test are somewhat more sophisticated statistical distribution tests with underlying hypothesis tests. These measures are widely used in CL and implemented in some corpus analysis tools, such as WordSmith Tools (Scott 1997), Wmatrix (Rayson et al. 2009), and AntConc (Anthony 2005). One problem with these measures is that p-values tend to be very low across the board, because two text corpora couldn’t be equal. The more important problem, however, is that they are designed to compare statistically independent events and handle corpora as a bag of words. These tests use the total number of words in the corpus and don’t consider an uneven distribution of words within a corpus (Lijffijt 2014).

References

McGillivray, Barbara, and Gábor Mihály Tóth, ‘Frequency’, in Applying Language Technology in Humanities Research: Design, Application, and the Underlying Logic, ed. by Barbara McGillivray and Gábor Mihály Tóth (Cham: Springer International Publishing, 2020), pp. 35–46 <https://doi.org/10.1007/978-3-030-46493-6_3>
Froehlich, Heather, ‘Corpus Analysis with Antconc’, Programming Historian, 2015 <https://programminghistorian.org/en/lessons/corpus-analysis-with-antconc> [accessed 15 February 2021]
Savoy, Jacques, ‘Comparative Evaluation of Term Selection Functions for Authorship Attribution’, Literary and Linguistic Computing, 30.2 (2015), 246–61 <https://doi.org/10.1093/llc/fqt047>
Gries, Stefan Th., ‘The Most Under-Used Statistical Method in Corpus Linguistics: Multi-Level (and Mixed-Effects) Models’, Corpora, 10.1 (2015), 95–125 <https://doi.org/10.3366/cor.2015.0068>
Bestgen, Yves, ‘Inadequacy of the Chi-Squared Test to Examine Vocabulary Differences between Corpora’, Literary and Linguistic Computing, 29.2 (2014), 164–70 <https://doi.org/10.1093/llc/fqt020>
Lijffijt, Jefrey, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki, and Heikki Mannila, ‘Significance Testing of Word Frequencies in Corpora’, Digital Scholarship in the Humanities, 31.2 (2014), 374–97 <https://doi.org/10.1093/llc/fqu064>
Parsons, Kathryn, Agata McCormac, and Marcus Butavicius, Human Dimensions of Corpora Comparison: An Analysis of Kilgarriff’s (2001) Approach (DEFENCE SCIENCE AND TECHNOLOGY ORGANISATION EDINBURGH (AUSTRALIA) COMMAND CONTROL COMMUNICATIONS AND INTELLIGENCE DIV, April 2009) <https://apps.dtic.mil/docs/citations/ADA506585> [accessed 17 September 2019]
Lüdeling, Anke, and Merja Kytö, eds., ‘Statistical Methods for Corpus Exploitation’, in Handbooks of Linguistics and Communication Science (Berlin, New York: Mouton de Gruyter, 2009) <https://doi.org/10.1515/9783110213881.2.777>
Oakes, Michael P., and Malcolm Farrow, ‘Use of the Chi-Squared Test to Examine Vocabulary Differences in English Language Corpora Representing Seven Different Countries’, Literary and Linguistic Computing, 22.1 (2007), 85–99 <https://doi.org/10.1093/llc/fql044>
Anthony, Laurence, ‘AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom’, 2005, pp. 729–37 <https://doi.org/10.1109/IPCC.2005.1494244>
Lancaster, H. O., and E. Seneta, ‘Chi-Square Distribution’, in Encyclopedia of Biostatistics, ed. by Peter Armitage and Theodore Colton (Chichester, UK: John Wiley & Sons, Ltd, 2005), p. b2a15018 <https://doi.org/10.1002/0470011815.b2a15018>
Rayson, Paul, ‘Wmatrix: A Web-Based Corpus Processing Environment.’ (Lancaster, UK: Computing Department, Lancaster University, 2005)
Gabriela Cavaglià, ‘Measuring Corpus Homogeneity Using a Range of Measures for Inter-Document Distance Measuring Corpus Homogeneity Using a Range of Measures for Inter-Document Distance | Request PDF’, ResearchGate, 2002 <https://www.researchgate.net/publication/267784878_ITRI-02-08_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance> [accessed 17 September 2019]
Kilgarriff, Adam, ‘Comparing Corpora’, International Journal of Corpus Linguistics, 6.1 (2001), 97–133 <https://doi.org/10.1075/ijcl.6.1.05kil>
Scott, Mike, ‘PC Analysis of Key Words and Key Key Words’, System, 25.2 (1997), 233–45 <https://doi.org/10.1016/S0346-251X(97)00011-0>
Kilgarriff, Adam, ‘Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora’, in Fifth Workshop on Very Large Corpora, 1997 <https://www.aclweb.org/anthology/W97-0122> [accessed 6 September 2019]
Kilgarriff, Adam, ‘Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora’, in Fifth Workshop on Very Large Corpora, 1997 <https://www.aclweb.org/anthology/W97-0122> [accessed 12 June 2020]
Cressie, Noel A. C., and Timothy R. C. Read, ‘Pearsons-X2 and the Loglikelihood Ratio Statistic-G2: A Comparative Review’, 1989 <https://doi.org/10.2307/1403582>
Plackett, R. L., ‘Karl Pearson and the Chi-Squared Test’, International Statistical Review / Revue Internationale de Statistique, 51.1 (1983), 59 <https://doi.org/10.2307/1402731>
Brinegar, Claude S., ‘Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship’, Journal of the American Statistical Association, 58.301 (1963), 85–96 <https://doi.org/10.1080/01621459.1963.10500834>