Clarification on the regex of the tokenizer configuration
#15
by
antoine-agthe-unity
- opened
Your JSON tokenization config uses the following regex to split the input.
" ?[^(\s|[.,!?\u2026\u3002\uff0c\u3001\u0964\u06d4\u060c])]+"
or \s?[^(\s|[.,!?…。,、।۔،])]+
I don't understand why you have nested brackets for [.,!?…。,、।۔،]
Why do you separate \s from the rest of the characters.
Also also why do you try to capture (with parenthesis)?
Wouldn't it be the same as ?[^\s.,!?…。,、।۔،]+ ?
Maybe there is a fancy regex pattern I don't know.
For the context, I try to load this configuration with .NET C# and the standard regex engine doesn't not understand this regex.