response-toxicity-classifier-base
BERT classifier from Skoltech, finetuned on contextual data with 4 labels.
Training
Skoltech/russian-inappropriate-messages was finetuned on a multiclass data with four classes (check the exact mapping between idx and label in model.config).
- OK label — the message is OK in context and does not intent to offend or somehow harm the reputation of a speaker.
- Toxic label — the message might be seen as a offensive one in given context.
- Severe toxic label — the message is offencive, full of anger and was written to provoke a fight or any other discomfort
- Risks label — the message touches on sensitive topics and can harm the reputation of the speaker (i.e. religion, politics)
The model was finetuned on a soon-to-be-posted dialogs datasets.
Evaluation results
Model achieves the following results on the validation datasets (will be posted soon):
|
OK - F1-score |
TOXIC - F1-score |
SEVERE TOXIC - F1-score |
RISKS - F1-score |
| internet dialogs |
0.896 |
0.348 |
0.490 |
0.591 |
| chatbot dialogs |
0.940 |
0.295 |
0.729 |
0.46 |
Use in transformers
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('tinkoff-ai/response-toxicity-classifier-base')
model = AutoModelForSequenceClassification.from_pretrained('tinkoff-ai/response-toxicity-classifier-base')
inputs = tokenizer('[CLS]привет[SEP]привет![SEP]как дела?[RESPONSE_TOKEN]норм, у тя как?', max_length=128, add_special_tokens=False, return_tensors='pt')
with torch.inference_mode():
logits = model(**inputs).logits
probas = torch.softmax(logits, dim=-1)[0].cpu().detach().numpy()
The work was done during internship at Tinkoff by Nikita Stepanov, mentored by Alexander Markov.