The Chi-squared test and Log-likelihood ratio test are somewhat more sophisticated statistical distribution tests with underlying hypothesis tests. These measures are widely used in CL and implemented in some corpus analysis tools, such as WordSmith Tools (Scott 1997), Wmatrix (Rayson et al. 2009), and AntConc (Anthony 2005). One problem with these measures is that p-values tend to be very low across the board, because two text corpora couldn’t be equal. The more important problem, however, is that they are designed to compare statistically independent events and handle corpora as a bag of words. These tests use the total number of words in the corpus and don’t consider an uneven distribution of words within a corpus (Lijffijt 2014).


