SP tokenizer missing mode tokens

by keremturgutlu - opened Apr 7, 2023

Simply load with sp_model = spm.SentencePieceProcessor(spiece.model) and run:

sp_model.piece_to_id("[NLG]")
sp_model.piece_to_id("[S2S]")
sp_model.piece_to_id("[NLU]")

all maps to <unk>

•

It turns out that these are not special tokens in the vocab but rather plain text, e.g. like a prefix prompt. A bit wasteful I guess :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment