Clarification on the regex of the tokenizer configuration

#15

by antoine-agthe-unity - opened 9 days ago

9 days ago

Your JSON tokenization config uses the following regex to split the input.

" ?[^(\s|[.,!?\u2026\u3002\uff0c\u3001\u0964\u06d4\u060c])]+"
or \s?[^(\s|[.,!?…。，、।۔،])]+

I don't understand why you have nested brackets for [.,!?…。，、।۔،]
Why do you separate \s from the rest of the characters.
Also also why do you try to capture (with parenthesis)?

Wouldn't it be the same as ?[^\s.,!?…。，、।۔،]+ ?
Maybe there is a fancy regex pattern I don't know.

For the context, I try to load this configuration with .NET C# and the standard regex engine doesn't not understand this regex.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment