SweLL
Collection
Språkbanken Text,
Corpus published 2025 via Språkbanken Text
Included resources
SweLL-gold is a second language learner corpus, featuring pseudonymization, normalization and correction-annotation
SweLL-pilot is a second language learner corpus, featuring CEFR labeling
DaLAJ resources are a collection of sentence pairs (original - corrected) containing one error each
MultiGED -- Multilingual Grammatical Error Detection - is a dataset for grammamatical error detection, featuring five languages (Czech, German, English, Italian, Swedish). The data is organized by sentences, where each token has an annotation whether it is correct or incorrect (c or i). The corrected version is not provided. MultiGED has been used for a shared task (https://spraakbanken.github.io/multiged-2023/)
MuClaGED -- Multi-Class Grammatical Error Detection - is a dataset for Swedish only, organized by sentences, each incorrect token associated with the type of correction (Orthography, Syntax, Morphology, etc.) and the type of edit (Addition, Deletion, Replacement)
MultiGED -- Multilingual Grammatical Error Correction is a dataset for grammamatical error detection, featuring twelve languages (Czech, English,