Baseline_BERTimbau-large_Vote_Identification-Council-PT

Model Description

This model is a Named Entity Recognition (NER) system specifically designed for extracting voting information from Portuguese municipal council meeting minutes (atas de câmara municipal). It identifies and classifies entities related to voting processes, including subjects being voted on, counting results, and voter participation patterns.

The model is built on top of BERTimbau Large, a BERT-based language model pre-trained on Portuguese text, with a custom linear classification head optimized for voting information extraction.

Key Features

Extracts structured voting information from Portuguese text
Identifies 8 distinct entity types with BIO tagging
Optimized for municipal council meeting minutes
High recall (90.44%) ensuring most voting entities are captured
Built on state-of-the-art Portuguese language model

Model Details

Architecture: BertimbauLinearVotIE (Custom NER architecture)
Base Model: neuralmind/bert-large-portuguese-cased
Model Type: Token Classification (NER)
Parameters: ~335M
- 24 transformer layers
- 1024 hidden dimensions
- 16 attention heads
- 4096 intermediate size
Max Sequence Length: 512 tokens
Learning Rate: 5e-5
Warmup: 0.1
Batch Size: 16
Optimizer: AdamW
Weight Decay: 0.01
Number of Labels: 17 (8 entity types + BIO tagging)
Framework: PyTorch + Hugging Face Transformers

Intended Uses

Primary Use Cases

Extracting structured voting data from Portuguese municipal council minutes
Automating analysis of voting patterns in Portuguese governmental documents
Creating datasets of voting records from unstructured text
Supporting transparency and civic engagement initiatives

Out-of-Scope Uses

General-purpose NER for Portuguese (use domain-general models instead)
Voting information extraction in other languages
Real-time classification (model is large and may be slow)
Legal decision-making without human review

Labels

The model predicts 17 labels using BIO (Begin-Inside-Outside) tagging scheme:

Entity Types

Label	Description	Example
VOTING	The voting action itself	"votação", "aprovação por votação"
SUBJECT	What is being voted on	"proposta de orçamento", "regulamento municipal"
COUNTING-MAJORITY	Majority-based vote counting	"maioria", "por maioria"
COUNTING-UNANIMITY	Unanimous vote counting	"unanimidade", "por unanimidade"
VOTER-FAVOR	Voters in favor	"a favor", "votos favoráveis"
VOTER-AGAINST	Voters against	"contra", "votos contra"
VOTER-ABSTENTION	Voters abstaining	"abstenção", "absteve-se"
VOTER-ABSENT	Absent voters	"ausente", "falta"
O	Outside any entity	-

BIO Tagging

B-{ENTITY}: Beginning of an entity
I-{ENTITY}: Inside/continuation of an entity
O: Outside any entity (not part of a voting-related span)

Performance

Evaluated on a test set of 529 examples from Portuguese municipal council minutes:

Entity-Level Metrics (Strict Matching)

Metric	Value
Precision	57.73%
Recall	90.44%
F1 Score	70.48%
Accuracy	98.31%

Per-Entity Type Performance

Entity Type	Precision	Recall	F1 Score	Support
VOTING	93.54%	99.14%	96.26%	467
VOTER-ABSTENTION	93.66%	100.00%	96.73%	133
VOTER-FAVOR	89.30%	97.33%	93.14%	300
COUNTING-MAJORITY	86.96%	96.77%	91.60%	62
VOTER-AGAINST	81.82%	100.00%	90.00%	36
VOTER-ABSENT	76.19%	88.89%	82.05%	18
SUBJECT	55.17%	80.35%	65.42%	458
COUNTING-UNANIMITY	23.58%	79.02%	36.32%	305

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "your-username/bertimbau-large-vote-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example text
text = "A proposta foi aprovada por maioria, com 5 votos a favor e 2 contra."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[pred.item()] for pred in predictions[0]]

# Print results
for token, label in zip(tokens, labels):
    if label != "O":
        print(f"{token}: {label}")

Advanced Usage with Pipeline

from transformers import pipeline

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # Groups B- and I- tags
)

# Extract entities
text = "A câmara deliberou por unanimidade aprovar o regulamento municipal."
entities = ner_pipeline(text)

for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} (score: {entity['score']:.4f})")

Output Example

proposta: SUBJECT (score: 0.9234)
aprovada: VOTING (score: 0.9678)
maioria: COUNTING-MAJORITY (score: 0.9123)
votos a favor: VOTER-FAVOR (score: 0.8956)
contra: VOTER-AGAINST (score: 0.9345)

Training Procedure

Hyperparameters

Base Model: neuralmind/bert-large-portuguese-cased
Architecture: Linear classification layer on top of BERT embeddings
Best Checkpoint: Step 2100
Evaluation Examples: 529

Computational Resources

Model Size: ~647 MB (safetensors format)
Precision: float32

Limitations and Biases

Limitations

Domain-Specific: Optimized for municipal council minutes; may not generalize well to other document types or voting contexts
Language Variant: Trained on European Portuguese; performance on Brazilian Portuguese may vary
Entity Complexity: Lower performance on COUNTING-UNANIMITY (36.32% F1) suggests difficulty with certain entity types
Model Size: 647 MB model may be too large for resource-constrained environments
Context Length: Limited to 512 tokens (BERT constraint)

Known Issues

COUNTING-UNANIMITY: Significantly lower precision (23.58%) indicates many false positives for unanimity-related phrases
SUBJECT entities: Moderate precision (55.17%) suggests subject boundaries can be mprecise
Subject extraction may include extraneous context or miss complete subject descriptions

Potential Biases

Training data reflects Portuguese municipal governance language and may not capture regional variations
May reflect biases present in governmental documents
Performance may vary across different municipalities or time periods.

License

This model is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Acknowledgments

Built on BERTimbau by NeuralMind
Developed for improving transparency in Portuguese municipal governance

Downloads last month: 12

Collection including liaad/Citilink-BERTimbau-large-Vote-Identification-pt-baseline

Citilink

Collection

Citilink aims to create AI models to facilitate the understanding of city council meetings • 13 items • Updated 22 days ago

Evaluation results

Entity F1
self-reported

0.705
Entity Precision
self-reported

0.577
Entity Recall
self-reported

0.904