🏔️ Kashmiri Tokenizer Suite
First systematic tokenization infrastructure for Kashmiri (ISO 639-3: kas)
Five tokenizer architectures trained on the KS-LIT-3M 3.1M-word corpus.
| Folder | Architecture | Vocab | Best For |
|---|---|---|---|
kashmiri_char_tokenizer/ |
Character-Level | ~120 | ASR, OCR, char-level LM |
kashmiri_word_tokenizer/ |
Word-Level | ~120K | Lookup, bag-of-words |
kashmiri_wordpiece_tokenizer/ |
WordPiece (BERT) | 32K | NER, POS, classification |
kashmiri_bpe_tokenizer/ |
BPE (GPT) | 32K | Text generation, MT |
kashmiri_unigram_tokenizer/ |
Unigram (SentencePiece) | 32K | Multilingual, NMT |
Quick Start
from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download
# Load WordPiece (recommended for most tasks)
path = hf_hub_download(
repo_id="Omarrran/Kashmiri_Tokenizers",
filename="kashmiri_wordpiece_tokenizer/tokenizer.json")
tok = Tokenizer.from_file(path)
encoded = tok.encode("کٲشِر زَبان چھِیہٕ بُہُت خٲص")
print(encoded.tokens)
print(tok.decode(encoded.ids))
Novel Metrics Introduced
- Diacritic Preservation Score (DPS) — OOV-penalized; measures diacritic attachment
- Morphological Boundary Alignment (MBA) — IoU vs gold morpheme boundaries
- Composite Quality Score (CQS) — weighted ranking across all metrics
Citation
@misc{malik2025kashmiritokenizer,
title = {A Comprehensive Tokenization Study for Kashmiri},
author = {Malik, Haq Nawaz},
year = {2025},
url = {https://huggingface.co/Omarrran/Kashmiri_Tokenizers}
}
Part of the Kashmiri NLP Infrastructure Project
KS-LIT-3M corpus · KashScript transliteration · Kashmiri OCR dataset ·
Kashmiri TTS dataset · This tokenizer suite ← you are here
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support