OldBERTur: Named Entity Recognition for Diplomatic Old Icelandic

This model performs Named Entity Recognition (NER) on diplomatic Old Icelandic texts sourced from medieval manuscripts, identifying Person and Location entities.

Model Description

Model type: Token classification (NER)
Base model: mideind/IceBERT
Language: Old Icelandic (diplomatic transcription level)
Entity types: Person, Location (BIO tagging scheme)
F1 Score: 0.79

This model is fine-tuned for NER through token classification by using a domain-adapted version of IceBERT adapted through Masked Language Modelling (MLM) on a corpus of 15 Old Icelandic texts. The NER model is designed for diplomatic Old Icelandic texts as defined in the Menota diplomatic transcription level description.

For normalised transcriptions of Old Icelandic texts, please use oldbertur-normalised-old-icelandic-ner instead.

Intended Uses

Named entity recognition in diplomatic Old Icelandic texts
Digital humanities research on Medieval Icelandic literature
Semi-automatic annotation of historical Icelandic documents
Information extraction from saga literature and historical texts

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Riksarkivet/oldbertur-diplomatic-old-icelandic-ner")
model = AutoModelForTokenClassification.from_pretrained("Riksarkivet/oldbertur-diplomatic-old-icelandic-ner")

# Use aggregation_strategy="first" to properly combine subword tokens
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

text = "Jſland byɢþiſc fyrſt vr Norvegi a dogom Harallz Halfdanar ſonar"
results = ner_pipeline(text)

for entity in results:
    # entity['word'] corrupts multi-byte UTF-8 chars; use start/end offsets instead
    word = text[entity['start']:entity['end']]
    print(f"{word}: {entity['entity_group']} ({entity['score']:.3f})")

Expected output:

Jſland: Location (1.000)
Norvegi: Location (1.000)
Harallz Halfdanar ſonar: Person (1.000)

Training Data

This model uses the M^R + I + MIM training configuration, combining:

Source	Description
Menota (M)	Diplomatic Old Icelandic texts from the Medieval Nordic Text Archive
IcePaHC (I)	Icelandic Parsed Historical Corpus (normalised Old Icelandic texts)
MIM-GOLD-NER (MIM)	Modern Icelandic NER data for data augmentation

The superscript ^R means that sentence-level class resampling (sCR) was applied to the diplomatic Menota data. This is done to increase the amount of diplomatic data, while the normalised I and MIM are used to augment the diplomatic target data.

Training set statistics:

Configuration	Person	Location	Total
M^R + I + MIM	40,191	12,326	52,517

Entity breakdown by source:

Source	Person	Location	Total
Menota (M)	3,969	472	4,441
IcePaHC (I)	2,797	362	3,159
M^R after resampling	20,485	2,622	23,107
MIM-GOLD-NER	15,587	9,002	24,589

Evaluation sets:

Dev: 1,122 entities (912 Person; 210 Location)
Test: 1,153 entities (922 Person; 231 Location)

The dev and test sets consist exclusively of diplomatic Old Icelandic texts in order to reflect our target domain.

Evaluation Results

Metric	Score
F1	0.79
Precision	0.77
Recall	0.81

Labels

The model uses BIO tagging with the following labels:

Label	Description
`O`	Outside any entity
`B-Person`	Beginning of a person name
`I-Person`	Inside/continuation of a person name
`B-Location`	Beginning of a location name
`I-Location`	Inside/continuation of a location name

Limitations

Orthography: This model is trained on diplomatic texts. For normalised transcriptions, use oldbertur-normalised-old-icelandic-ner.
Entity types: Only Person and Location entities are supported. Other entity types (organisations, dates, etc.) are not recognised due to scarcity in the training data.
Time period: Primarily trained on texts from 1250-1400 CE. Performance may vary on texts from other periods.
Domain: Optimised for saga literature and historical texts. May perform differently on other text types.

Domain-Adaptation

To adapt the underlying model, IceBERT, which is trained on modern texts, we use task-adaptive pre-training (TAPT) to facilitate model familiarity with Old Icelandic diplomatic texts prior to fine-tuning for NER. We use a corpus of 15 unannotated diplomatic Old Icelandic texts from Menota for the domain-adaptation.

Hyperparameters:

Base model: mideind/IceBERT
Epochs: 8
Learning rate: 3e-5
Batch size: 32
Max sequence length: 256 tokens
Warm-up ratio: 6%
MLM masking probability: 15%

NER Training Procedure

NER is framed as a token classification task, with a classification head added on top of the domain-adapted IceBERT.

Hyperparameters:

Epochs: 5
Learning rate: 2e-5
Batch size: 16
Max sequence length: 256 tokens
Warm-up ratio: 10%
Weight decay: 0.01

Class imbalance handling:

Weighted cross-entropy loss with class weights: 0.1 for non-entities (O), 30.0 for entity classes to address class imbalance in the training data.

Citation

If you use this model, please cite:

@misc{coming_soon,
}

Resources

For more information on our code and data, see our GitHub repository. For more information about the work in general, see our paper (coming soon).

Code & Data: GitHub Repository
Normalised NER Model: oldbertur-normalised-old-icelandic-ner
Paper: Coming soon.
Base Model: mideind/IceBERT

Acknowledgments

We are grateful for the great work carried out by the projects below, and for making it possible for us to use their data in order to conduct our academic research and develop NER models for Medieval Icelandic. We thank developers, annotators, scholars, project managers, and anyone else who has contributed to these projects. We also express our sincerest gratitude to the students from Uppsala University who assisted in marking and annotating entities in the two Menota works Codex Wormianus (AM 242 fol) and Vǫluspá in Hauksbók (AM 544 4to).

Menota: The Menota project
Icelandic Parsed Historical Corpus (IcePaHC): Wallenberg et al., 2024
IceBERT base model: Snæbjarnarson et al. 2022
MIM-GOLD-NER: Ingólfsdóttir et al. 2020

License

This model is released under the GNU General Public License v3.0 (GPL-3.0).

Contact

For questions or issues, please open an issue on the GitHub repository or contact: phenningsson@me.com

Downloads last month: 59

Safetensors

Model size

0.1B params

Tensor type

I64

F32

Model tree for Riksarkivet/oldbertur-diplomatic-old-icelandic-ner

Base model

mideind/IceBERT

Finetuned

(3)

this model