Zoek in de site...

Nijmegen Corpora

CLS researchers have compiled multiple language corpora and datasets for specific research questions. In an effort to share resources, many of these corpora and datasets are listed here. External links to additional documentation or source material are provided when available.

Corpus/dataset Summary

Baldey
A database of auditory lexical decision

Language: Dutch
Contact: Mirjam Ernestus

Auditory lexical decision experiment in which 5,541 Dutch content words and pseudo-words were presented to 20 native speakers

Data file includes response times, accuracy rates, phonological information, morphological information and auditory stimuli with Praat text grids providing phonemic alignments.

Interested? Click here

BigListenNL

Language: Dutch
Contact: Odette Scharenborg

A database of 3000 frequent mono- and bisyllabic Dutch words uttered in isolation by four different speakers (2 Female). For the same project websoftware was developed to embed these words in noise, on the fly, with a choice of five different types of noise and adjustable SNRs.
CELEX

Language: Dutch
Contact: Mirjam Ernestus

Web-based version of the Center for Lexical Information (CELEX) database, containing Dutch word frequencies.

Interested? More information can be found here:

CGN
Corpus Gesproken Nederlands

Language: Dutch
Researcher: Nelleke Oostdijk

The Spoken Dutch Corpus (CGN) contains some 900 hours of spoken standard Dutch from adult speakers in Flanders and the Netherlands.

Researchers at Radboud University can download the CGN from the Radboud Software Center.

Other researchers can obtain the corpus here.

CogSci2016

Language: Dutch, English
Contact: Stefan Frank

Self-paced reading data on Dutch sentences (Dutch native speakers) and English sentences (Dutch and German native speakers).

Interested? Click here

Corpus NGT
Corpus Nederlandse Gebarentaal

Language: NGT (Dutch Sign Language)

The Corpus NGT is a collection of interaction data of deaf signers using Sign Language of the Netherlands (NGT). Data consist of recordings with multiple synchronized video cameras, accompanied by gloss and translation annotations for a growing subset of the data (annotation to be continued from 2022-2024).

Data type: Video, EAF (Elan)

Interested? Contact: Onno Crasborn

DBD/TCULT

Dutch Bilingualism Database / Talen en Culturen in Utrechtse Lombok en Transvaal

The DBD comprises data (over 1,500 sessions) originating from Dutch, Sranan, Sarnami, Papiamentu, Arabic, Berber and 1Turkish speakers . At the basis of the collection is the research project TCULT in which intercultural language contacts in the Dutch city of Utrecht were studied. DBD established a first curation of the TCULT data and added many more bilingual data sets.

DELNN

Dutch English Lombard Native Non-Native

Languages: native Dutch, native English, non-native English

Contact: Mirjam Ernestus

The DELNN corpus contains plain (speech produced in quiet) and Lombard (speech produced in noise) speech from 9 native American-English speakers reading English stimuli and 30 native Dutch speakers reading English and Dutch stimuli.

Interested? Click here

ECHO
European Cultural Heritage Online: Sign Language Case Study

Languages: NGT (Sign Language of the Netherlands), BSL (British Sign Language), SSL (Swedish Sign Language)

This is a small but richly annotated corpus of three European sign languages. It contains linguistically annotated video files of Sign Language of the Netherlands (Nederlandse Gebarentaal), British Sign Language, and Swedish Sign Language; data include fable stories, dialogues, small lexicons, and some poetry.

Data type: Video, EAF (Elan)

Interested? Contact: Onno Crasborn

FAME!
Fame Speech Corpus

Language: Dutch, Frisian
Contact: Henk van den Heuvel

Consists of 203 audio segments of approximately 5 minutes long extracted from various radio programs covering a time span of almost 50 years (1966-2015), adding a longitudinal dimension to the database.
The content of the recordings are very diverse including radio programs about culture, history, literature, sports, nature, agriculture, politics, society and languages.

Interested? Click here

Global Signbank

Language: NGT

Global Signbank is a lexical database that holds datasets from sign languages from all over the world. The original dataset is for Sign Language of the Netherlands (NGT), later additions include Kata Kolok (Bali, Indonesia), Norwegian Sign Language (NTS), and Israeli Sign Language (ISL). Shadow copies of lexical datasets from VGT (Flemish Sign Language), LSFB (French Belgian Sign  language), and ASL (American Sign Language) are also archived, as well as datasets based on a variety of publications on Gestuno and International Sign.

Contact: Onno Crasborn

Nanny

A multimodal corpus of speech to infant and adult listeners

Language: Dutch

Contact: Mirjam Ernestus

An audio and video corpus of speech addressed to 28 11-month-olds. The corpus allows comparisons between adult speech directed toward infants, familiar adults, and unfamiliar adult addressees as well as of caregivers' word teaching strategies across word classes.

Data type: Video and audio recordings + orthographic transcriptions.

Location: Journal of the Acoustical Society of America 134, EL534-EL540.

NEHOL
Negerhollands Database

This database contains a rich digitized corpus of historical as well as almost contemporary texts (1742-1936) in Negerhollands.  A considerable and representative part of the database is annotated, which enables the user to make advanced search queries.

Interested? Click here

NCCCz
Nijmegen Corpus of Casual Czech

Language: Czech
Contact: Mirjam Ernestus

Around 30 hours of high-quality recordings featuring 60 Czech speakers from Prague conversing among friends.

Data type: Video and audio recordings + orthographic transcriptions.

Interested? Click here

NCCFr
Nijmegen Corpus of Casual French

Language: French
Contact: Mirjam Ernestus

35 hours of high-quality recordings featuring 46 French speakers conversing among friends.

Data type: Video and audio recordings + orthographic transcriptions.

Interested? Click here

NCCSp
Nijmegen Corpus of Casual Spanish

Language: Spanish
Contact: Mirjam Ernestus

Around 30 hours of high-quality recordings featuring 52 Spanish speakers from Madrid conversing among friends.

Data type: Video and audio recordings + orthographic transcriptions.

Interested? Click here

NCSE

Nijmegen Corpus of Spanish English

Language: English
Contact: Mirjam Ernestus

Around 38 hours of high-quality recordings featuring 34 Spanish speakers from Madrid talking in English to two Dutch confederates, in an informal and in a formal setting.

Data type: Video and audio recordings + orthographic transcriptions.

Interested? Click here

ND-corpus
Nijmegen-Dijkstra corpus

Language: Dutch
Contact: Nienke Dijkstra

One-day audio recordings of ca. 50 Dutch (typically developing) infants at three ages (8-12-16 months) for a total of 16 hours * 50 infants * 3 ages = 2400 hours (infant speech and infant directed speech)

Data type: Audio recordings (.wav files) + questionnaires on language development and temperament.

Interested? Click here

Normative data on Dutch idiomatic expressions: Native speakers

Language: Dutch, English

In the context of the research programme ‘Idiomatic Second Language Acquisition’, we collected Normative data of 374 Dutch idiomatic expressions by 390 native speakers. In an online test, we asked participants to judge various dimensions of idiomatic expressions on a five-point scale: Frequency, Usage, Familiarity, Imageability, and Transparency. In addition, we objectively assessed their knowledge of idiom meaning by means of a multiple choice question. The dataset contains the aggregated results per expression for the 5 subjective dimensions and the objective Idiom knowledge recognition.

Interested? Contact: Ferdy Hubers

Polyphone

Language: Dutch
Contact: Henk van den Heuvel

Contains speech from 5050 speakers from all regions of The Netherlands. Speakers had to read digits and sentences and answer questions over telephone.

Interested? Click here

SoNaR

Language: Dutch
Contact: Nelleke Oostdijk

SoNaR is a 500-million-word reference corpus of contemporary (1954-2012) written Dutch. From a wide variety of text types including both texts from conventional media and texts from the new media.

Location: online version OpenSoNaR