Baseline_BERTimbau-large_Vote_Identification-Council-PT
Model Description
This model is a Named Entity Recognition (NER) system specifically designed for extracting voting information from Portuguese municipal council meeting minutes (atas de câmara municipal). It identifies and classifies entities related to voting processes, including subjects being voted on, counting results, and voter participation patterns.
The model is built on top of BERTimbau Large, a BERT-based language model pre-trained on Portuguese text, with a custom linear classification head optimized for voting information extraction.
Key Features
- Extracts structured voting information from Portuguese text
- Identifies 8 distinct entity types with BIO tagging
- Optimized for municipal council meeting minutes
- High recall (90.44%) ensuring most voting entities are captured
- Built on state-of-the-art Portuguese language model
Model Details
- Architecture: BertimbauLinearVotIE (Custom NER architecture)
- Base Model: neuralmind/bert-large-portuguese-cased
- Model Type: Token Classification (NER)
- Parameters: ~335M
- 24 transformer layers
- 1024 hidden dimensions
- 16 attention heads
- 4096 intermediate size
- Max Sequence Length: 512 tokens
- Learning Rate: 5e-5
- Warmup: 0.1
- Batch Size: 16
- Optimizer: AdamW
- Weight Decay: 0.01
- Number of Labels: 17 (8 entity types + BIO tagging)
- Framework: PyTorch + Hugging Face Transformers
Intended Uses
Primary Use Cases
- Extracting structured voting data from Portuguese municipal council minutes
- Automating analysis of voting patterns in Portuguese governmental documents
- Creating datasets of voting records from unstructured text
- Supporting transparency and civic engagement initiatives
Out-of-Scope Uses
- General-purpose NER for Portuguese (use domain-general models instead)
- Voting information extraction in other languages
- Real-time classification (model is large and may be slow)
- Legal decision-making without human review
Labels
The model predicts 17 labels using BIO (Begin-Inside-Outside) tagging scheme:
Entity Types
| Label | Description | Example |
|---|---|---|
| VOTING | The voting action itself | "votação", "aprovação por votação" |
| SUBJECT | What is being voted on | "proposta de orçamento", "regulamento municipal" |
| COUNTING-MAJORITY | Majority-based vote counting | "maioria", "por maioria" |
| COUNTING-UNANIMITY | Unanimous vote counting | "unanimidade", "por unanimidade" |
| VOTER-FAVOR | Voters in favor | "a favor", "votos favoráveis" |
| VOTER-AGAINST | Voters against | "contra", "votos contra" |
| VOTER-ABSTENTION | Voters abstaining | "abstenção", "absteve-se" |
| VOTER-ABSENT | Absent voters | "ausente", "falta" |
| O | Outside any entity | - |
BIO Tagging
- B-{ENTITY}: Beginning of an entity
- I-{ENTITY}: Inside/continuation of an entity
- O: Outside any entity (not part of a voting-related span)
Performance
Evaluated on a test set of 529 examples from Portuguese municipal council minutes:
Entity-Level Metrics (Strict Matching)
| Metric | Value |
|---|---|
| Precision | 57.73% |
| Recall | 90.44% |
| F1 Score | 70.48% |
| Accuracy | 98.31% |
Per-Entity Type Performance
| Entity Type | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
| VOTING | 93.54% | 99.14% | 96.26% | 467 |
| VOTER-ABSTENTION | 93.66% | 100.00% | 96.73% | 133 |
| VOTER-FAVOR | 89.30% | 97.33% | 93.14% | 300 |
| COUNTING-MAJORITY | 86.96% | 96.77% | 91.60% | 62 |
| VOTER-AGAINST | 81.82% | 100.00% | 90.00% | 36 |
| VOTER-ABSENT | 76.19% | 88.89% | 82.05% | 18 |
| SUBJECT | 55.17% | 80.35% | 65.42% | 458 |
| COUNTING-UNANIMITY | 23.58% | 79.02% | 36.32% | 305 |
Usage
Installation
pip install transformers torch
Basic Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model_name = "your-username/bertimbau-large-vote-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Example text
text = "A proposta foi aprovada por maioria, com 5 votos a favor e 2 contra."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
# Print results
for token, label in zip(tokens, labels):
if label != "O":
print(f"{token}: {label}")
Advanced Usage with Pipeline
from transformers import pipeline
# Create NER pipeline
ner_pipeline = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple" # Groups B- and I- tags
)
# Extract entities
text = "A câmara deliberou por unanimidade aprovar o regulamento municipal."
entities = ner_pipeline(text)
for entity in entities:
print(f"{entity['word']}: {entity['entity_group']} (score: {entity['score']:.4f})")
Output Example
proposta: SUBJECT (score: 0.9234)
aprovada: VOTING (score: 0.9678)
maioria: COUNTING-MAJORITY (score: 0.9123)
votos a favor: VOTER-FAVOR (score: 0.8956)
contra: VOTER-AGAINST (score: 0.9345)
Training Procedure
Hyperparameters
- Base Model: neuralmind/bert-large-portuguese-cased
- Architecture: Linear classification layer on top of BERT embeddings
- Best Checkpoint: Step 2100
- Evaluation Examples: 529
Computational Resources
- Model Size: ~647 MB (safetensors format)
- Precision: float32
Limitations and Biases
Limitations
- Domain-Specific: Optimized for municipal council minutes; may not generalize well to other document types or voting contexts
- Language Variant: Trained on European Portuguese; performance on Brazilian Portuguese may vary
- Entity Complexity: Lower performance on COUNTING-UNANIMITY (36.32% F1) suggests difficulty with certain entity types
- Model Size: 647 MB model may be too large for resource-constrained environments
- Context Length: Limited to 512 tokens (BERT constraint)
Known Issues
- COUNTING-UNANIMITY: Significantly lower precision (23.58%) indicates many false positives for unanimity-related phrases
- SUBJECT entities: Moderate precision (55.17%) suggests subject boundaries can be mprecise
- Subject extraction may include extraneous context or miss complete subject descriptions
Potential Biases
- Training data reflects Portuguese municipal governance language and may not capture regional variations
- May reflect biases present in governmental documents
- Performance may vary across different municipalities or time periods.
License
This model is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Acknowledgments
- Built on BERTimbau by NeuralMind
- Developed for improving transparency in Portuguese municipal governance
- Downloads last month
- 12
Collection including liaad/Citilink-BERTimbau-large-Vote-Identification-pt-baseline
Evaluation results
- Entity F1self-reported0.705
- Entity Precisionself-reported0.577
- Entity Recallself-reported0.904