Projekt

The project is being conducted at the Chair for Digital Humanities and the Trier Center for Digital Humanities at Trier University, Germany, and is being funded by the DFG (German Research Foundation) in two phases from 2020 to 2026.

Second phase: Beyond Words. Semantic and multiword distinctive features for an investigation of literary subgenres (2024–2026)

Contrastive text analysis, where one group of texts is compared to another, is a widely used procedure in linguistics and literary studies, both in qualitative and quantitative research designs. Measures of ‘keyness’ or ‘distinctiveness’ have been developed, evaluated, and used in a range of related fields, in particular Information Retrieval, Corpus and Computational Linguistics, and Computational Literary Studies. The project proposed here builds directly on the insights, experience, and results from the ongoing Zeta and Company project that works on a systematic, methodological exploration of this quantitative contrastive paradigm.

In Beyond Words, the literary domain we focus on is again the French contemporary novel, with a special focus on the three popular subgenres of science fiction, crime fiction, and sentimental novels, but English-language literary and non-literary corpora are also taken into account. The overall objective of Beyond Words is to significantly narrow the gap between the (statistically speaking) distinctive features of specific groups of exemplars of these literary subgenres, on the one hand, and their (meaningful, interpretive) relationship to an ambitiously complex understanding of the characteristic properties of literary subgenres, on the other hand. Our strategy to achieve this objective relies on a three-pronged approach: First, rather than focusing on single word forms, we extract more complex and semantically-richer linguistic features from the texts that we believe are better able to capture meaningful characteristics of literary subgenres. Second, we create a conceptualization of the subgenres that is both explicit and flexible by creating fine-grained, descriptive, prototypical subgenre profiles based on a broad consideration of the relevant research literature. Third, we maintain our focus on qualitative and quantitative strategies for the evaluation of the discriminatory power and the interpretability of the distinctive features we identify.

With this approach, we can contribute decisively to Computational Literary Studies, both at the level of methodological innovation regarding feature extraction and measures of distinctiveness suitable for complex features and at the level of a deepened understanding of what constitutes subgenres conceptually and how the particular subgenres in question can best be described.

First phase: Zeta and Company. Measures of Distinctiveness for Computational Literary Studies (2021–2023)

The methodological and epistemological paradigm of comparison is deeply rooted in the Humanities. Whether in qualitative or quantitative research, comparison allows to determine similarities and differences as well as affinities and contrasts; it thereby sharpens the analyst’s eye and adds specificity and meaningfulness to analyses. Against this backdrop, the purpose of the research described here is to enhance our understanding of and propose improvements to quantitative, comparative methods of analysis of two or more collections of texts in the domain of Computational Literary Studies.

The focus has been on a key method in such comparative analyses, that of using statistical measures of distinctiveness that allow researchers to extract features (e.g. words or parts-of-speech) that are characteristic or ‘distinctive’ of a given group of texts when compared to another group of texts.In areas as diverse as Information Retrieval, Computational Linguistics and Computational Literary Studies, a wide range of statistical measures of distinctiveness have been developed for the fundamental task of identifying distinctive features.

Three broad types of measures can be distinguished, each using quite distinct information for their calculations. The first type of measures takes as input the observed relative frequencies of the features in each of the two groups of texts taken together and compares them (e.g. log-likelihood test). The second type takes as input frequency distributions of the features built from each individual text in the two groups of texts (e.g. t-test). The third type takes as input the dispersion of the features across the texts in each group, that is, compares how equally distributed the features are across each group of texts (e.g. Zeta).

In order to reach a deeper understanding of measures of distinctiveness and propose improvements to their implementation and use, we have created and published suitable benchmark corpora; we have analysed a significant range of existing measures of distinctiveness to determine and compare their statistical properties and formalize them in a joint conceptual model; based on this model, we have implemented them in a common framework; we have implemented and used several evaluation strategies to assess and compare these measures’ performance; and we have conducted an application study comparing several subgenres of the contemporary French novel (highbrow literary novels and popular, lowbrow novels like crime fiction, romance and science fiction novels); finally, we have disseminated the key results of the study in academic publications as well as in the form of a pedagogical web portal.