🗃️ NLPre-PL Dataset


Test datasets


The NLPre-PL benchmark consists of a set of various linguistic tasks, including segmentation, lemmatization, morphological analysis, part-of-speech tagging, and dependency parsing, as well as a collection of manually annotated test datasets selected for evaluating NLP models performing these tasks.

The following test datasets are selected for the NLPre-PL benchmark:

  • nkjp1m-test – the test split of the 1M-word manually annotated portion of Polish National Corpus (the 220307_NKJP1M_NODIG release in the TEI format) in two tagsets: NKJP (Morfeusz) and UD. The NKJP sentences are parsed automatically, but the dependency trees are not manually verified. Consequently, the NKJP1M test set remains suitable for all tasks except those that rely on accurate dependency parses.
  • pl_pdb-ud-test.conllu – the test part of the manually annotated Polish Dependency Bank 2.0 automatically converted into the UD schema.
  • pl_pdb-ud-narrative-test.conllu – the test part of the narrative domain subdivision of the manually annotated Polish Dependency Bank 3.0 automatically converted into the UD schema. This test dataset is not publicly accessible and is selected specifically to assess whether the model's training data has not been contaminated by test data. This test set comprises 1831 dependency trees for a total of 24K tokens. These UD trees are annotated identically to those in pl_pdb-ud-test.conllu.



Test textual data


Download the zip file with the textual data to be processed.