The methodological and epistemological paradigm of comparison is deeply rooted in the Humanities. Whether in qualitative or quantitative research, comparison allows to determine similarities and differences as well as affinities and contrasts; it thereby sharpens the analyst’s eye and adds specificity and meaningfulness to analyses. Against this backdrop, the purpose of the research described here is to enhance our understanding of and propose improvements to quantitative, comparative methods of analysis of two or more collections of texts in the domain of Computational Literary Studies.
The focus will be on a key method in such comparative analyses, that of using statistical measures of distinctiveness that allow researchers to extract features (e.g. words or parts-of-speech) that are characteristic or ‘distinctive’ of a given group of texts when compared to another group of texts.In areas as diverse as Information Retrieval, Computational Linguistics and Computational Literary Studies, a wide range of statistical measures of distinctiveness have been developed for the fundamental task of identifying distinctive features.
Three broad types of measures can be distinguished, each using quite distinct information for their calculations. The first type of measures takes as input the observed relative frequencies of the features in each of the two groups of texts taken together and compares them (e.g. log-likelihood test). The second type takes as input frequency distributions of the features built from each individual text in the two groups of texts (e.g. t-test). The third type takes as input the dispersion of the features across the texts in each group, that is, compares how equally distributed the features are across each group of texts (e.g. Zeta).
In order to reach a deeper understanding of measures of distinctiveness and propose improvements to their implementation and use, we will create and publish suitable benchmark corpora; we will analyse a significant range of existing measures of distinctiveness to determine and compare their statistical properties and formalize them in a joint conceptual model; based on this model, we will implement them in a common framework; we will implement and use several evaluation strategies to assess and compare these measures’ performance; and we will conduct an in-depth application study comparing several subgenres of the contemporary French novel (highbrow literary novels and popular, lowbrow novels like crime fiction, romance and science fiction novels); finally, we will disseminate the key results of the study in academic publications as well as in the form of an interactive, pedagogical web portal.