Support for AI-centred Education and Research at the Faculty of Arts

Date of news: 12 July 2022

Multimodal Communication Studies (MCS) is an AI-centred point of gravity within the contributions of the Faculty of Arts to Radboud AI. During the first half of 2019, a series of four brainstorm sessions was organised with a variety of Faculty members. On the basis of the ideas collected, we specified the educational ICT building blocks in the forms of pipelines consisting of a series of fundamental steps. The building blocks were specified so that they could be used as broadly as possible, in multiple educational modules and courses, and that they could also be used as lab environments for research purposes. It was decided to develop two types of educational building blocks, one for the alignment of speech and text, and one for the alignment of sign language to other modalities such as text and audio.

Speech-text alignment

Speech-text alignment is a variant of automatic speech recognition (ASR). This variant is called Forced Alignment (FA). ASR converts speech in an audio file to text. The difference with ASR is that FA also requires a text file with the correct transcription as input. Based on both inputs, the FA tool aligns audio and text by adding time stamps for onset and offset of each spoken word and/or sound into the text file. This is often the basis for training ASR systems, but also very useful for subtitling audio (and video) recordings, and for opening up transcribed audio archives for further explorations. By requiring alignment at the phone level, speech therapists can study specific disorders that manifest themselves in the (defective) realization of phones. In order to be able to work with FA (and interpret the outcomes) the students need a deeper knowledge of how FA works and which linguistic knowledge is important for this. This makes the speech-text alignment module relevant and educationally interesting for both AI and linguistic students.

This module was designed and developed with and by students of the Giphouse company, a student-run company that develops software and practices entrepreneurship at Radboud University. The module was set up as a webportal following a stepwise walkthrough concept:

  1. Select project
  2. Upload files
  3. Forced Alignment
  4. Grapheme to Phoneme
  5. Check dictionary
  6. Overview

At each step students learn which elements to consider. An important part of the learning content is how to treat words in the transcriptions that are not in the lexicon of the aligner. The aligner can only deal with words for which it has a pronunciation (a sequence of sound symbols) in its lexicon. If not, then the word is regarded as out of vocabulary.

The webportal is available at https://equestria.cls.ru.nl/. It can be used by anyone interested. Making an account is required for safeguarding the materials and making sure that users can continue where they left the program.

Sign language alignment

For the sign language alignment modules a completely different approach was chosen. Here, a student developer was hired to develop a pipeline in a Jupyter notebook environment under Google Colab. This material is composed of four modules:

  1. Analysing video: pictures, pixels and matrices, techniques for video segmentation
  2. Analysing sign language: a walkthrough for detecting resting positions of a signer
  3. Discovering individual signs: clustering algorithms to find some of the most frequent signs
  4. Aligning subtitles for people who do not understand NGT

The four online modules are aimed at a first exploration of how sign language videos can be analyzed automatically. They can be used as self-study materials or used in class. Just as for audio files of speech, automated analysis of sign videos can have multiple functions: they can perform basic analysis for creating metadata about files, they can support or even replace time-consuming manual annotations for certain linguistic analyses, or they can contribute to higher-level language technologies like machine translation. The four modules are not aimed at teaching how to write software code, but at helping students to gain insight in the whole process of automated analysis, from understanding what the nature of video data is to incorporating machine learning techniques in the analysis. Along the way, they see snippets of code in the Python programming language that may help them to become more familiar with the structure of Python. Throughout the modules, students are invited to experiment with parameter settings of simulators and answer questions.

Application in teaching and research

The Equestria speech text alignment module has been used in various courses in the curriculum of linguistics students. It was used in the master course Research Methods in Language and Speech Pathology and the bachelor course Acoustic Phonetics and Linglab (where students have to set up and carry out their own research project). The Jupyter Colab sign language alignment modules are used in the curriculum of the bachelor course AI in Action as of May 2022. The students are second year bachelor students from linguistics, communication, science and history students of the Faculty of Arts. The modules have been important for jumpstarting the research of master students working in the intersection of AI and sign language. They have been used in the following:

  • “Supporting Sign Language Learning With a Visual Dictionary” by Mark Wijkhuizen (https://github.com/MarkWijkhuizen/Supporting-Sign-Language-Learning-With-a-Visual-Dictionary). This MSc thesis introduces a visually searchable dictionary that allows sign language learners to perform a sign in front of a webcam and retrieve the most similar signs. This research explored the linguistic and technological challenges of a visually searchable NGT dictionary. A guest researcher from Reykjavik University (Carlos Mena) was hosted at CLST in March-April 2022. He worked on a comparison of the underlying Radboud made FA algorithm (using Kaldi) with a more generic, well known FA implementation (the Montreal Forced Aligner). 3 with a focus on improving sign detection. The thesis was finished Fall 2021 and a new MSc student will pick up this line of investigation in May 2022.
  • “Mind the Linguistic Gap: The Importance of the Linguistic Properties and Data Representation of Sign Language in a Sign Spotting Task” (working title) by Javier Martínez Rodríguez. This MSc thesis investigates the extent to which a deep learning algorithm naturally learns linguistically-relevant features of sign language and how changes to the data might support it in doing so. The thesis will be completed Fall 2022.

Currently we are developing plans for extending the modules further and for adding various levels tailored to the background of the students and researchers that need to use them (in both AI and multimodal communication studies).