OldBERTur: Named Entity Recognition for Diplomatic Old Icelandic
This model performs Named Entity Recognition (NER) on diplomatic Old Icelandic texts sourced from medieval manuscripts, identifying Person and Location entities.
Model Description
- Model type: Token classification (NER)
- Base model: mideind/IceBERT
- Language: Old Icelandic (diplomatic transcription level)
- Entity types: Person, Location (BIO tagging scheme)
- F1 Score: 0.79
This model is fine-tuned for NER through token classification by using a domain-adapted version of IceBERT adapted through Masked Language Modelling (MLM) on a corpus of 15 Old Icelandic texts. The NER model is designed for diplomatic Old Icelandic texts as defined in the Menota diplomatic transcription level description.
For normalised transcriptions of Old Icelandic texts, please use oldbertur-normalised-old-icelandic-ner instead.
Intended Uses
- Named entity recognition in diplomatic Old Icelandic texts
- Digital humanities research on Medieval Icelandic literature
- Semi-automatic annotation of historical Icelandic documents
- Information extraction from saga literature and historical texts
How to Use
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Riksarkivet/oldbertur-diplomatic-old-icelandic-ner")
model = AutoModelForTokenClassification.from_pretrained("Riksarkivet/oldbertur-diplomatic-old-icelandic-ner")
# Use aggregation_strategy="first" to properly combine subword tokens
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")
text = "J趴land by散镁i趴c fyr趴t vr Norvegi a dogom Harallz Halfdanar 趴onar"
results = ner_pipeline(text)
for entity in results:
# entity['word'] corrupts multi-byte UTF-8 chars; use start/end offsets instead
word = text[entity['start']:entity['end']]
print(f"{word}: {entity['entity_group']} ({entity['score']:.3f})")
Expected output:
J趴land: Location (1.000)
Norvegi: Location (1.000)
Harallz Halfdanar 趴onar: Person (1.000)
Training Data
This model uses the MR + I + MIM training configuration, combining:
| Source | Description |
|---|---|
| Menota (M) | Diplomatic Old Icelandic texts from the Medieval Nordic Text Archive |
| IcePaHC (I) | Icelandic Parsed Historical Corpus (normalised Old Icelandic texts) |
| MIM-GOLD-NER (MIM) | Modern Icelandic NER data for data augmentation |
The superscript R means that sentence-level class resampling (sCR) was applied to the diplomatic Menota data. This is done to increase the amount of diplomatic data, while the normalised I and MIM are used to augment the diplomatic target data.
Training set statistics:
| Configuration | Person | Location | Total |
|---|---|---|---|
| MR + I + MIM | 40,191 | 12,326 | 52,517 |
Entity breakdown by source:
| Source | Person | Location | Total |
|---|---|---|---|
| Menota (M) | 3,969 | 472 | 4,441 |
| IcePaHC (I) | 2,797 | 362 | 3,159 |
| MR after resampling | 20,485 | 2,622 | 23,107 |
| MIM-GOLD-NER | 15,587 | 9,002 | 24,589 |
Evaluation sets:
- Dev: 1,122 entities (912 Person; 210 Location)
- Test: 1,153 entities (922 Person; 231 Location)
The dev and test sets consist exclusively of diplomatic Old Icelandic texts in order to reflect our target domain.
Evaluation Results
| Metric | Score |
|---|---|
| F1 | 0.79 |
| Precision | 0.77 |
| Recall | 0.81 |
Labels
The model uses BIO tagging with the following labels:
| Label | Description |
|---|---|
O |
Outside any entity |
B-Person |
Beginning of a person name |
I-Person |
Inside/continuation of a person name |
B-Location |
Beginning of a location name |
I-Location |
Inside/continuation of a location name |
Limitations
- Orthography: This model is trained on diplomatic texts. For normalised transcriptions, use oldbertur-normalised-old-icelandic-ner.
- Entity types: Only Person and Location entities are supported. Other entity types (organisations, dates, etc.) are not recognised due to scarcity in the training data.
- Time period: Primarily trained on texts from 1250-1400 CE. Performance may vary on texts from other periods.
- Domain: Optimised for saga literature and historical texts. May perform differently on other text types.
Domain-Adaptation
To adapt the underlying model, IceBERT, which is trained on modern texts, we use task-adaptive pre-training (TAPT) to facilitate model familiarity with Old Icelandic diplomatic texts prior to fine-tuning for NER. We use a corpus of 15 unannotated diplomatic Old Icelandic texts from Menota for the domain-adaptation.
Hyperparameters:
- Base model: mideind/IceBERT
- Epochs: 8
- Learning rate: 3e-5
- Batch size: 32
- Max sequence length: 256 tokens
- Warm-up ratio: 6%
- MLM masking probability: 15%
NER Training Procedure
NER is framed as a token classification task, with a classification head added on top of the domain-adapted IceBERT.
Hyperparameters:
- Epochs: 5
- Learning rate: 2e-5
- Batch size: 16
- Max sequence length: 256 tokens
- Warm-up ratio: 10%
- Weight decay: 0.01
Class imbalance handling:
- Weighted cross-entropy loss with class weights: 0.1 for non-entities (O), 30.0 for entity classes to address class imbalance in the training data.
Citation
If you use this model, please cite:
@misc{coming_soon,
}
Resources
For more information on our code and data, see our GitHub repository. For more information about the work in general, see our paper (coming soon).
- Code & Data: GitHub Repository
- Normalised NER Model: oldbertur-normalised-old-icelandic-ner
- Paper: Coming soon.
- Base Model: mideind/IceBERT
Acknowledgments
We are grateful for the great work carried out by the projects below, and for making it possible for us to use their data in order to conduct our academic research and develop NER models for Medieval Icelandic. We thank developers, annotators, scholars, project managers, and anyone else who has contributed to these projects. We also express our sincerest gratitude to the students from Uppsala University who assisted in marking and annotating entities in the two Menota works Codex Wormianus (AM 242 fol) and V谦lusp谩 in Hauksb贸k (AM 544 4to).
- Menota: The Menota project
- Icelandic Parsed Historical Corpus (IcePaHC): Wallenberg et al., 2024
- IceBERT base model: Sn忙bjarnarson et al. 2022
- MIM-GOLD-NER: Ing贸lfsd贸ttir et al. 2020
License
This model is released under the GNU General Public License v3.0 (GPL-3.0).
Contact
For questions or issues, please open an issue on the GitHub repository or contact: phenningsson@me.com
- Downloads last month
- 59
Model tree for Riksarkivet/oldbertur-diplomatic-old-icelandic-ner
Base model
mideind/IceBERT