Improvement to SimpleTokenizer

The implementation of SimpleTokenizer splits on some delimiters, remove white spaces, build the vocabulary, and then to decode it joins by inserting white spaces between two elements.

This leads to the improper insertion of white spaces between delimiters which were not white spaces.

For instance, "hello--world" would be encoded and decoded as "hello -- world", with two additional white spaces which were not there.

Instead, this implementation preserves the white space in the vocabulary, remove only non-empty strings, and join on an empty string, recovering the original string as such.

```
class SimpleTokenizerV3:

    split_regex = r'([,.:;?_!"()\'\s]|--)'
    
    def __init__(self, raw_text):
        preprocessed = re.split(self.split_regex, raw_text)
        preprocessed = [item for item in preprocessed if item]
        all_tokens = sorted(list(set(preprocessed)))
        all_tokens.extend(["<|endoftext|>", "<|unk|>"])
        vocab = {token:integer for integer,token in enumerate(all_tokens)}
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(self.split_regex, text)
        preprocessed = [item for item in preprocessed if item]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = "".join([self.int_to_str[i] for i in ids])
        return text
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement to SimpleTokenizer #1017

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improvement to SimpleTokenizer #1017

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions