Skip to content

Improvement to SimpleTokenizer #1017

@6801318d8d

Description

@6801318d8d

The implementation of SimpleTokenizer splits on some delimiters, remove white spaces, build the vocabulary, and then to decode it joins by inserting white spaces between two elements.

This leads to the improper insertion of white spaces between delimiters which were not white spaces.

For instance, "hello--world" would be encoded and decoded as "hello -- world", with two additional white spaces which were not there.

Instead, this implementation preserves the white space in the vocabulary, remove only non-empty strings, and join on an empty string, recovering the original string as such.

class SimpleTokenizerV3:

    split_regex = r'([,.:;?_!"()\'\s]|--)'
    
    def __init__(self, raw_text):
        preprocessed = re.split(self.split_regex, raw_text)
        preprocessed = [item for item in preprocessed if item]
        all_tokens = sorted(list(set(preprocessed)))
        all_tokens.extend(["<|endoftext|>", "<|unk|>"])
        vocab = {token:integer for integer,token in enumerate(all_tokens)}
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(self.split_regex, text)
        preprocessed = [item for item in preprocessed if item]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = "".join([self.int_to_str[i] for i in ids])
        return text

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions