Supported file formats

Preferred metadata format for distribution through our CLARIN C Centre:

Preferred data formats for data distribution through ACE:

ACE is the CLARIN Knowledge Centre for Atypical Communication Expertise (see Due to its collaboration with The Language Archive of the MPI for Psycholinguistics, ACE accepts all data formats listed here:

Preferred input formats for our webservices:

You will need to consult the webservice specification for webservices you intend to use, but in general, we have the following preferred input formats:

  • plain text (UTF-8 encoded) - (MIME: plain/text) - The most basic text format, does not offer further any further structural or mark-up semantics. Certain webservices may pose additional constraints, such as requiring one sentence or even one word per line (for e.g. a lexicon). Webservices may also attempt to automatically infer structure like paragraphs based on newlines. You should always use the UTF-8 (unicode) character encoding, other encodings are often not supported. UNIX line endings are usually preferred for most tools.
  • FoLiA XML - (MIME: application/folia+xml) - This is an XML-based format for linguistic annotation, developed in the scope of the CLARIN-NL, CLARIAH-CORE and CLARIAH-PLUS projects and used as a format for language resource exchange and corpus representation. It is widely used in the dutch & flemish NLP community and a lot of our NLP tools will either accept and/or produce FoLiA XML. See for more information about this format.
  • WAV - (MIME: audio/x-wav) - Most of our audio services accept this format, usually preferably 16/24 bit, 44.1/48 kHz, uncompressed PCM, 1 or 2 channels. Our audio services will likely also accept MP3 and OGG, but these come with quality loss due to compression.

Non-preferred input formats for our webservices:

  • Microsoft Word/Excel/etc.. - For both NLP as well as data storage purposes, proprietary word-processing formats such as Microsoft Word (doc/docx) are always discouraged. Alternatives such as Open Document Text (ODT) are preferred. Nevertheless, we offer a webservice (Piereling) which can attempt to convert such documents.
  • PDF - The use of PDF is ubiquitous but often very problematic as an *input* format for NLP purposes as it is hard to extract the textual information; other formats are preferred. It may be a valid source of input for OCR such as provided by our webservice PICCL. It is also a valid output format for presentational purposes.
  • HTML - Like PDF, HTML is suitable for output, but hardly ever makes a good input format.