Supported file formats
Preferred metadata format for distribution through our CLARIN C Centre:
- CMDI - This format is explained here: https://www.clarin.eu/content/component-metadata
Preferred data formats for data distribution through ACE:
ACE is the CLARIN Knowledge Centre for Atypical Communication Expertise (see https://ace.ruhosting.nl/). Due to its collaboration with The Language Archive of the MPI for Psycholinguistics, ACE accepts all data formats listed here: https://archive.mpi.nl/tla/accepted-file-formats.
Preferred input formats for our webservices:
You will need to consult the webservice specification for webservices you intend to use, but in general, we have the following preferred input formats:
- plain text (UTF-8 encoded) - (MIME: plain/text) - The most basic text format, does not offer further any further structural or mark-up semantics. Certain webservices may pose additional constraints, such as requiring one sentence or even one word per line (for e.g. a lexicon). Webservices may also attempt to automatically infer structure like paragraphs based on newlines. You should always use the UTF-8 (unicode) character encoding, other encodings are often not supported. UNIX line endings are usually preferred for most tools.
- FoLiA XML - (MIME: application/folia+xml) - This is an XML-based format for linguistic annotation, developed in the scope of the CLARIN-NL, CLARIAH-CORE and CLARIAH-PLUS projects and used as a format for language resource exchange and corpus representation. It is widely used in the dutch & flemish NLP community and a lot of our NLP tools will either accept and/or produce FoLiA XML. See https://proycon.github.io/folia for more information about this format.
- WAV - (MIME: audio/x-wav) - Most of our audio services accept this format, usually preferably 16/24 bit, 44.1/48 kHz, uncompressed PCM, 1 or 2 channels. Our audio services will likely also accept MP3 and OGG, but these come with quality loss due to compression.
Non-preferred input formats for our webservices:
- Microsoft Word/Excel/etc.. - For both NLP as well as data storage purposes, proprietary word-processing formats such as Microsoft Word (doc/docx) are always discouraged. Alternatives such as Open Document Text (ODT) are preferred. Nevertheless, we offer a webservice (Piereling) which can attempt to convert such documents.
- PDF - The use of PDF is ubiquitous but often very problematic as an *input* format for NLP purposes as it is hard to extract the textual information; other formats are preferred. It may be a valid source of input for OCR such as provided by our webservice PICCL. It is also a valid output format for presentational purposes.
- HTML - Like PDF, HTML is suitable for output, but hardly ever makes a good input format.