Zoek in de site...


CLS researchers have developed several software packages and tools. In an effort to share resources many of them are listed here. External links to additional documentation or source material are provided when available.

Software Description

Colibri Core
Efficient n-gram & skipgram modelling on text corpora

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool colibri-patternmodeller which allows you to build, view, manipulate and query pattern models.

A practical XML-based Format for Linguistic Annotation

FoLiA is an XML-based annotation format, suitable for the representation of linguistically annotated language resources. FoLiA’s intended use is as a format for storing and/or exchanging language resources, including corpora.

An advanced Natural Language Processing suite for Dutch

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package. Frog's current version will tokenize, tag, lemmatize, and morphologically segment word tokens in Dutch text files, will assign a dependency graph to each sentence, will identify the base phrase chunks in the sentence, and will attempt to find and label all named entities.

Language and Speech Software Portal

The portal provides access to software developed at CLST, the Language Machines research group, and partners.

Tilburg Memory-Based Tagger

MBT is a memory-based tagger-generator and tagger in one. The tagger-generator part can generate a sequence tagger on the basis of a training set of tagged sequences; the tagger part can tag new sequences. MBT is used by Frog for Dutch tagging.

Tools for the TiCC Software Stack

TiCCUtils is a generic utility library shared by several parts of the TiCC (Tilburg centre for Cognition and Communication) software stack, i.e. TiMBL and most software that builds on TiMBL.

Tilburg Memory-Based Learner
TiMBL is an open source software package implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG.
An advanced rule-based unicode-aware tokenizer

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation.

Webdemo Description

Oral history annotation tool
Metadata editor for Oral History recordings

A web application to list, correct and add metadata specifications for Oral History recordings. Each OH metadata item contains administrative and descriptive information about a specific recording (typically an interview). The description of an OH metadata item is defined by the Clarin Component Registry.

Corpus Editor for Syntactically Annotated Resources

Web application that offers a simple way to do syntactic research in annotated text corpora. It communicates with the CorpusStudioWeb back-end 'Crpp'. Two main purposes: (1) browse texts, (2) conduct syntactic searches with definable output per hit. Searches are translated to Xquery 'under the hood'

Coreference Editor for Syntactically Annotated XML corpora

Both a semi-automatic coreference resolution system as well as a windows program that allows batch conversion between different syntactic annotation formats.

Collection Bank

A web application to list, correct and add "Corpus Collections". Each corpus collection is a bundle of resources. The description of a corpus collection is defined by the Clarin Component Registry.

Corpus search project handler - Windows

A windows program that facilitates in-depth quantitative syntactic research for linguists.

Corpus search project handler - Web application

A web application that facilitates in-depth quantitative syntactic research for linguists.

Digital Literacy Instructor

Dig Lin is a Dutch Digital Literacy Instructor.

Dictionary of Brabantic dialects

e-WBD is a tool for searching the dictionaries of Brabantic dialects.

Dictionary of Limburgian dialects

e-WLD is a tool for searching the dictionaries of Limburgian dialects.

Dictionary of dialects in Gelderland

e-WGD is a tool for searching the dictionaries of dialects in Gelderland.

Dictionary of dialects in Achterhoek and Liemers

e-WALD is a tool for searching the dictionaries of dialects in Achterhoek and Liemers.

Add links to a named-entity enriched .folia.xml file

Perform named-entity linking for a .folia.xml file that has already been equipped with named entity resolution.  The program is a C# application that also runs under MONO on Linux.

Add a 'surfaced' syntax layer to a Dutch .folia.xml file

Assuming a .folia.xml file with POS-tagged text, parse the text syntactically by calling Alpino and then add the parse as a <syntax> layer into folia. The Alpino parse is 'surfaced' so that all resulting constituents are continuous (on the surface and below it).

English spelling corrector

Fowlt is an online, free-to-use context-sensitive English spelling checker. It follows the setup of the Dutch spelling checker Valkuil.net; both are context sensitive. In its stand-alone form, Fowlt is an application that takes plaintext as input, and returns FoLiA XML with information about the detected errors and possible corrections.

Lama Events
Calendar application

Lama Events is a calendar application listing events in the near future. The events are detected and selected by a fully automatic procedure in the Dutch Twitter stream

Limburgse Spelling

Limburgsespelling.nl: learn about the limburgian spelling and practice.
Dutch-Frisian and Frisian-Dutch translator
Oersetter is an automatic translation system for Dutch-Frisian and Frisian-Dutch, developed at the University of Nijmegen, on behalf of, and in collaboration with, the Frisian Academy.
Word predictor
Soothsayer tries to predict what you're going to type, as you type. The goal of Soothsayer is to show that prediction works best when you "train" the software to text you have written previously.

Stemming 2017
Election results predictor

Stemming 2017 is an experiment to predict the results of the 2017 Dutch general elections, based on tweets.
Dutch spelling corrector
Valkuil is an online, free-to-use context-sensitive Dutch spelling checker. Valkuil is context sensitive. In its stand-alone form, Valkuil is an application that takes plaintext as input, and returns FoLiA XML with information about the detected errors and possible corrections.