This CLS talk has been cancelled.
CLS Talks showcase research done within the Centre for Language Studies (CLS) with the aim to increase awareness of the ongoing research in the institute, and to facilitate discussions and collaborations between researchers. In addition, several external speakers are invited to share their work.
The sessions take place every month on Thursdays at 16:00 and are open to all interested researchers.
Abstract
Comparative judgement (CJ), an assessment method in which judges are shown pairs of texts side-by-side and asked to choose which is “better”, has recently been introduced as a new method to generate reliable and valid proficiency scores for texts in learner corpora (Paquot et al., 2022). Recent (small-scale) studies have shown this approach to be effective for evaluating argumentative essays of varying lengths, even when texts cover a narrow proficiency span (e.g. CEFR B2-C1) or diverse essay prompts (Thwaites, Kollias, et al., 2024). They have also found that CJ assessments made by judges recruited through a crowdsourcing platform have similar validity and reliability to those made by linguists recruited through a community-driven approach (Thwaites, Paquot, et al., 2024)
This presentation reports on an ongoing large-scale study investigating the extent to which rubric-based judges and CJ raters focus on the same linguistic features when assessing texts. A CJ task was created in which professional raters (N=66) assessed a representative sample of 1300 texts from the ICLE corpus. Text-based measures representing the main rubric constructs (e.g., lexical complexity, cohesion) were then calculated on a subset of these texts (N=222) which had previously been manually error-annotated and assessed on the basis of the CEFR-rubric in the context of another project (Thewissen, 2013).
The results showed that the rank order produced by the expert judges was highly reliable (SSR = .823) and that 30% of the variance in rank order could be explained by the CEFR level of the text. Higher ranked texts were also found to have fewer errors and higher levels lexical sophistication, lexical diversity, syntactic complexity and cohesion. Taken together, this suggests that comparative judgement can be used to efficiently evaluate L2 texts and that the resulting rank order is a reliable and valid representation of the proficiency level of the texts.