Tokenization issue

by sh0416 - opened Aug 7, 2023

Aug 7, 2023

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/incoder-1B")
assert "from ." == tokenizer.decode(tokenizer("from .")["input_ids"], skip_special_tokens=True, cleanup_tokenization_spaces=False)

Raise an assertion error. I suspect that the encoding process remove space.. How to handle it?

sh0416

Aug 7, 2023

Oh, there is a typo in option.. clean_up_tokenization_spaces=False works great. Sorry for the confusion.

sh0416 changed discussion status to closed Aug 7, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment