Der Chi-Quadrat-Test und der Log-Likelihood-Ratio-Test sind etwas komplexere statistische Verteilungstests mit zugrunde liegenden Hypothesentests. Diese Maße werden in CL häufig verwendet und in einigen Korpusanalysetools wie WordSmith Tools (Scott 1997), Wmatrix (Rayson et al. 2009) und AntConc (Anthony 2005) implementiert. Ein Problem bei diesen Maßen besteht darin, dass p-Werte generell sehr niedrig sind, da zwei Textkorpora nicht gleich sein können. Das wichtigere Problem ist jedoch, dass sie darauf ausgelegt sind, statistisch unabhängige Ereignisse zu vergleichen und Korpora als „Bag of Words“ zu behandeln. Diese Tests verwenden die Gesamtzahl der Wörter im Korpus und berücksichtigen keine ungleichmäßige Verteilung von Wörtern innerhalb eines Korpus (Lijffijt 2014).

Bibliografie

McGillivray, Barbara, and Gábor Mihály Tóth, ‘Frequency’, in Applying Language Technology in Humanities Research: Design, Application, and the Underlying Logic, ed. by Barbara McGillivray and Gábor Mihály Tóth (Springer International Publishing, 2020), pp. 35–46, doi:10.1007/978-3-030-46493-6_3
Froehlich, Heather, ‘Corpus Analysis with Antconc’, Programming Historian, 2015 <https://programminghistorian.org/en/lessons/corpus-analysis-with-antconc> [accessed 15 February 2021]
Savoy, Jacques, ‘Comparative Evaluation of Term Selection Functions for Authorship Attribution’, Literary and Linguistic Computing, 30.2 (2015), pp. 246–61, http://doi.org/10.1093/llc/fqt047
Gries, Stefan Th., ‘The Most Under-Used Statistical Method in Corpus Linguistics: Multi-Level (and Mixed-Effects) Models’, Corpora, 10.1 (2015), pp. 95–125, http://doi.org/10.3366/cor.2015.0068
Bestgen, Yves, ‘Inadequacy of the Chi-Squared Test to Examine Vocabulary Differences between Corpora’, Literary and Linguistic Computing, 29.2 (2014), pp. 164–70, http://doi.org/10.1093/llc/fqt020
Lijffijt, Jefrey, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki, and Heikki Mannila, ‘Significance Testing of Word Frequencies in Corpora’, Digital Scholarship in the Humanities, 31.2 (2014), pp. 374–97, http://doi.org/10.1093/llc/fqu064
Parsons, Kathryn, Agata McCormac, and Marcus Butavicius, Human Dimensions of Corpora Comparison: An Analysis of Kilgarriff’s (2001) Approach (DEFENCE SCIENCE AND TECHNOLOGY ORGANISATION EDINBURGH (AUSTRALIA) COMMAND CONTROL COMMUNICATIONS AND INTELLIGENCE DIV, April 2009) <https://apps.dtic.mil/docs/citations/ADA506585> [accessed 17 September 2019]
Lüdeling, Anke, and Merja Kytö, eds., ‘Statistical Methods for Corpus Exploitation’, in Handbooks of Linguistics and Communication Science (Mouton de Gruyter, 2009), doi:10.1515/9783110213881.2.777
Oakes, Michael P., and Malcolm Farrow, ‘Use of the Chi-Squared Test to Examine Vocabulary Differences in English Language Corpora Representing Seven Different Countries’, Literary and Linguistic Computing, 22.1 (2007), pp. 85–99, http://doi.org/10.1093/llc/fql044
Anthony, Laurence, ‘AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom’, 2005, pp. 729–37, http://doi.org/10.1109/IPCC.2005.1494244
Lancaster, H. O., and E. Seneta, ‘Chi-Square Distribution’, in Encyclopedia of Biostatistics, ed. by Peter Armitage and Theodore Colton (John Wiley & Sons, Ltd, 2005), p. b2a15018, doi:10.1002/0470011815.b2a15018
Rayson, Paul, ‘Wmatrix: A Web-Based Corpus Processing Environment.’ (Computing Department, Lancaster University, 2005)
Gabriela Cavaglià, ‘Measuring Corpus Homogeneity Using a Range of Measures for Inter-Document Distance Measuring Corpus Homogeneity Using a Range of Measures for Inter-Document Distance | Request PDF’, ResearchGate, 2002 <https://www.researchgate.net/publication/267784878_ITRI-02-08_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance> [accessed 17 September 2019]
Kilgarriff, Adam, ‘Comparing Corpora’, International Journal of Corpus Linguistics, 6.1 (2001), pp. 97–133, http://doi.org/10.1075/ijcl.6.1.05kil
Scott, Mike, ‘PC Analysis of Key Words and Key Key Words’, System, 25.2 (1997), pp. 233–45, http://doi.org/10.1016/S0346-251X(97)00011-0
Kilgarriff, Adam, ‘Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora’, in Fifth Workshop on Very Large Corpora, 1997 <https://www.aclweb.org/anthology/W97-0122> [accessed 6 September 2019]
Kilgarriff, Adam, ‘Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora’, in Fifth Workshop on Very Large Corpora, 1997 <https://www.aclweb.org/anthology/W97-0122> [accessed 12 June 2020]
Cressie, Noel A. C., and Timothy R. C. Read, ‘Pearsons-X2 and the Loglikelihood Ratio Statistic-G2: A Comparative Review’, 1989, http://doi.org/10.2307/1403582
Plackett, R. L., ‘Karl Pearson and the Chi-Squared Test’, International Statistical Review / Revue Internationale de Statistique, 51.1 (1983), p. 59, http://doi.org/10.2307/1402731
Brinegar, Claude S., ‘Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship’, Journal of the American Statistical Association, 58.301 (1963), pp. 85–96, http://doi.org/10.1080/01621459.1963.10500834