Clarification on the regex of the tokenizer configuration

#15
by antoine-agthe-unity - opened

Your JSON tokenization config uses the following regex to split the input.

" ?[^(\s|[.,!?\u2026\u3002\uff0c\u3001\u0964\u06d4\u060c])]+"
or \s?[^(\s|[.,!?…。,、।۔،])]+

I don't understand why you have nested brackets for [.,!?…。,、।۔،]
Why do you separate \s from the rest of the characters.
Also also why do you try to capture (with parenthesis)?

Wouldn't it be the same as ?[^\s.,!?…。,、।۔،]+ ?
Maybe there is a fancy regex pattern I don't know.

For the context, I try to load this configuration with .NET C# and the standard regex engine doesn't not understand this regex.

Sign up or log in to comment