🏔️ Kashmiri Tokenizer Suite

First systematic tokenization infrastructure for Kashmiri (ISO 639-3: kas)
Five tokenizer architectures trained on the KS-LIT-3M 3.1M-word corpus.

Folder Architecture Vocab Best For
kashmiri_char_tokenizer/ Character-Level ~120 ASR, OCR, char-level LM
kashmiri_word_tokenizer/ Word-Level ~120K Lookup, bag-of-words
kashmiri_wordpiece_tokenizer/ WordPiece (BERT) 32K NER, POS, classification
kashmiri_bpe_tokenizer/ BPE (GPT) 32K Text generation, MT
kashmiri_unigram_tokenizer/ Unigram (SentencePiece) 32K Multilingual, NMT

Quick Start

from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download

# Load WordPiece (recommended for most tasks)
path = hf_hub_download(
    repo_id="Omarrran/Kashmiri_Tokenizers",
    filename="kashmiri_wordpiece_tokenizer/tokenizer.json")
tok = Tokenizer.from_file(path)

encoded = tok.encode("کٲشِر زَبان چھِیہٕ بُہُت خٲص")
print(encoded.tokens)
print(tok.decode(encoded.ids))

Novel Metrics Introduced

  • Diacritic Preservation Score (DPS) — OOV-penalized; measures diacritic attachment
  • Morphological Boundary Alignment (MBA) — IoU vs gold morpheme boundaries
  • Composite Quality Score (CQS) — weighted ranking across all metrics

Citation

@misc{malik2025kashmiritokenizer,
  title  = {A Comprehensive Tokenization Study for Kashmiri},
  author = {Malik, Haq Nawaz},
  year   = {2025},
  url    = {https://huggingface.co/Omarrran/Kashmiri_Tokenizers}
}

Part of the Kashmiri NLP Infrastructure Project

KS-LIT-3M corpus · KashScript transliteration · Kashmiri OCR dataset ·
Kashmiri TTS dataset · This tokenizer suite ← you are here

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support