OldBERTur: Named Entity Recognition for Diplomatic Old Icelandic

This model performs Named Entity Recognition (NER) on diplomatic Old Icelandic texts sourced from medieval manuscripts, identifying Person and Location entities.

Model Description

  • Model type: Token classification (NER)
  • Base model: mideind/IceBERT
  • Language: Old Icelandic (diplomatic transcription level)
  • Entity types: Person, Location (BIO tagging scheme)
  • F1 Score: 0.79

This model is fine-tuned for NER through token classification by using a domain-adapted version of IceBERT adapted through Masked Language Modelling (MLM) on a corpus of 15 Old Icelandic texts. The NER model is designed for diplomatic Old Icelandic texts as defined in the Menota diplomatic transcription level description.

For normalised transcriptions of Old Icelandic texts, please use oldbertur-normalised-old-icelandic-ner instead.

Intended Uses

  • Named entity recognition in diplomatic Old Icelandic texts
  • Digital humanities research on Medieval Icelandic literature
  • Semi-automatic annotation of historical Icelandic documents
  • Information extraction from saga literature and historical texts

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Riksarkivet/oldbertur-diplomatic-old-icelandic-ner")
model = AutoModelForTokenClassification.from_pretrained("Riksarkivet/oldbertur-diplomatic-old-icelandic-ner")

# Use aggregation_strategy="first" to properly combine subword tokens
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

text = "J趴land by散镁i趴c fyr趴t vr Norvegi a dogom Harallz Halfdanar 趴onar"
results = ner_pipeline(text)

for entity in results:
    # entity['word'] corrupts multi-byte UTF-8 chars; use start/end offsets instead
    word = text[entity['start']:entity['end']]
    print(f"{word}: {entity['entity_group']} ({entity['score']:.3f})")

Expected output:

J趴land: Location (1.000)
Norvegi: Location (1.000)
Harallz Halfdanar 趴onar: Person (1.000)

Training Data

This model uses the MR + I + MIM training configuration, combining:

Source Description
Menota (M) Diplomatic Old Icelandic texts from the Medieval Nordic Text Archive
IcePaHC (I) Icelandic Parsed Historical Corpus (normalised Old Icelandic texts)
MIM-GOLD-NER (MIM) Modern Icelandic NER data for data augmentation

The superscript R means that sentence-level class resampling (sCR) was applied to the diplomatic Menota data. This is done to increase the amount of diplomatic data, while the normalised I and MIM are used to augment the diplomatic target data.

Training set statistics:

Configuration Person Location Total
MR + I + MIM 40,191 12,326 52,517

Entity breakdown by source:

Source Person Location Total
Menota (M) 3,969 472 4,441
IcePaHC (I) 2,797 362 3,159
MR after resampling 20,485 2,622 23,107
MIM-GOLD-NER 15,587 9,002 24,589

Evaluation sets:

  • Dev: 1,122 entities (912 Person; 210 Location)
  • Test: 1,153 entities (922 Person; 231 Location)

The dev and test sets consist exclusively of diplomatic Old Icelandic texts in order to reflect our target domain.

Evaluation Results

Metric Score
F1 0.79
Precision 0.77
Recall 0.81

Labels

The model uses BIO tagging with the following labels:

Label Description
O Outside any entity
B-Person Beginning of a person name
I-Person Inside/continuation of a person name
B-Location Beginning of a location name
I-Location Inside/continuation of a location name

Limitations

  • Orthography: This model is trained on diplomatic texts. For normalised transcriptions, use oldbertur-normalised-old-icelandic-ner.
  • Entity types: Only Person and Location entities are supported. Other entity types (organisations, dates, etc.) are not recognised due to scarcity in the training data.
  • Time period: Primarily trained on texts from 1250-1400 CE. Performance may vary on texts from other periods.
  • Domain: Optimised for saga literature and historical texts. May perform differently on other text types.

Domain-Adaptation

To adapt the underlying model, IceBERT, which is trained on modern texts, we use task-adaptive pre-training (TAPT) to facilitate model familiarity with Old Icelandic diplomatic texts prior to fine-tuning for NER. We use a corpus of 15 unannotated diplomatic Old Icelandic texts from Menota for the domain-adaptation.

Hyperparameters:

  • Base model: mideind/IceBERT
  • Epochs: 8
  • Learning rate: 3e-5
  • Batch size: 32
  • Max sequence length: 256 tokens
  • Warm-up ratio: 6%
  • MLM masking probability: 15%

NER Training Procedure

NER is framed as a token classification task, with a classification head added on top of the domain-adapted IceBERT.

Hyperparameters:

  • Epochs: 5
  • Learning rate: 2e-5
  • Batch size: 16
  • Max sequence length: 256 tokens
  • Warm-up ratio: 10%
  • Weight decay: 0.01

Class imbalance handling:

  • Weighted cross-entropy loss with class weights: 0.1 for non-entities (O), 30.0 for entity classes to address class imbalance in the training data.

Citation

If you use this model, please cite:

@misc{coming_soon,
}

Resources

For more information on our code and data, see our GitHub repository. For more information about the work in general, see our paper (coming soon).

Acknowledgments

We are grateful for the great work carried out by the projects below, and for making it possible for us to use their data in order to conduct our academic research and develop NER models for Medieval Icelandic. We thank developers, annotators, scholars, project managers, and anyone else who has contributed to these projects. We also express our sincerest gratitude to the students from Uppsala University who assisted in marking and annotating entities in the two Menota works Codex Wormianus (AM 242 fol) and V谦lusp谩 in Hauksb贸k (AM 544 4to).

License

This model is released under the GNU General Public License v3.0 (GPL-3.0).

Contact

For questions or issues, please open an issue on the GitHub repository or contact: phenningsson@me.com

Downloads last month
59
Safetensors
Model size
0.1B params
Tensor type
I64
F32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Riksarkivet/oldbertur-diplomatic-old-icelandic-ner

Base model

mideind/IceBERT
Finetuned
(3)
this model