🏔️ Kashmiri Tokenizer Suite

First systematic tokenization infrastructure for Kashmiri (ISO 639-3: kas)
Five tokenizer architectures trained on the KS-LIT-3M 3.1M-word corpus.

Folder	Architecture	Vocab	Best For
`kashmiri_char_tokenizer/`	Character-Level	~120	ASR, OCR, char-level LM
`kashmiri_word_tokenizer/`	Word-Level	~120K	Lookup, bag-of-words
`kashmiri_wordpiece_tokenizer/`	WordPiece (BERT)	32K	NER, POS, classification
`kashmiri_bpe_tokenizer/`	BPE (GPT)	32K	Text generation, MT
`kashmiri_unigram_tokenizer/`	Unigram (SentencePiece)	32K	Multilingual, NMT

Quick Start

from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download

# Load WordPiece (recommended for most tasks)
path = hf_hub_download(
    repo_id="Omarrran/Kashmiri_Tokenizers",
    filename="kashmiri_wordpiece_tokenizer/tokenizer.json")
tok = Tokenizer.from_file(path)

encoded = tok.encode("کٲشِر زَبان چھِیہٕ بُہُت خٲص")
print(encoded.tokens)
print(tok.decode(encoded.ids))

Novel Metrics Introduced

Diacritic Preservation Score (DPS) — OOV-penalized; measures diacritic attachment
Morphological Boundary Alignment (MBA) — IoU vs gold morpheme boundaries
Composite Quality Score (CQS) — weighted ranking across all metrics

Citation

@misc{malik2025kashmiritokenizer,
  title  = {A Comprehensive Tokenization Study for Kashmiri},
  author = {Malik, Haq Nawaz},
  year   = {2025},
  url    = {https://huggingface.co/Omarrran/Kashmiri_Tokenizers}
}

Part of the Kashmiri NLP Infrastructure Project

KS-LIT-3M corpus · KashScript transliteration · Kashmiri OCR dataset ·
Kashmiri TTS dataset · This tokenizer suite ← you are here

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support