The implementation of SimpleTokenizer splits on some delimiters, remove white spaces, build the vocabulary, and then to decode it joins by inserting white spaces between two elements.
This leads to the improper insertion of white spaces between delimiters which were not white spaces.
For instance, "hello--world" would be encoded and decoded as "hello -- world", with two additional white spaces which were not there.
Instead, this implementation preserves the white space in the vocabulary, remove only non-empty strings, and join on an empty string, recovering the original string as such.
class SimpleTokenizerV3:
split_regex = r'([,.:;?_!"()\'\s]|--)'
def __init__(self, raw_text):
preprocessed = re.split(self.split_regex, raw_text)
preprocessed = [item for item in preprocessed if item]
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
self.str_to_int = vocab
self.int_to_str = {i:s for s,i in vocab.items()}
def encode(self, text):
preprocessed = re.split(self.split_regex, text)
preprocessed = [item for item in preprocessed if item]
preprocessed = [
item if item in self.str_to_int
else "<|unk|>" for item in preprocessed
]
ids = [self.str_to_int[s] for s in preprocessed]
return ids
def decode(self, ids):
text = "".join([self.int_to_str[i] for i in ids])
return text
The implementation of SimpleTokenizer splits on some delimiters, remove white spaces, build the vocabulary, and then to decode it joins by inserting white spaces between two elements.
This leads to the improper insertion of white spaces between delimiters which were not white spaces.
For instance, "hello--world" would be encoded and decoded as "hello -- world", with two additional white spaces which were not there.
Instead, this implementation preserves the white space in the vocabulary, remove only non-empty strings, and join on an empty string, recovering the original string as such.