Baseline_BERTimbau-large_Vote_Identification-Council-PT

Model Description

This model is a Named Entity Recognition (NER) system specifically designed for extracting voting information from Portuguese municipal council meeting minutes (atas de câmara municipal). It identifies and classifies entities related to voting processes, including subjects being voted on, counting results, and voter participation patterns.

The model is built on top of BERTimbau Large, a BERT-based language model pre-trained on Portuguese text, with a custom linear classification head optimized for voting information extraction.

Key Features

  • Extracts structured voting information from Portuguese text
  • Identifies 8 distinct entity types with BIO tagging
  • Optimized for municipal council meeting minutes
  • High recall (90.44%) ensuring most voting entities are captured
  • Built on state-of-the-art Portuguese language model

Model Details

  • Architecture: BertimbauLinearVotIE (Custom NER architecture)
  • Base Model: neuralmind/bert-large-portuguese-cased
  • Model Type: Token Classification (NER)
  • Parameters: ~335M
    • 24 transformer layers
    • 1024 hidden dimensions
    • 16 attention heads
    • 4096 intermediate size
  • Max Sequence Length: 512 tokens
  • Learning Rate: 5e-5
  • Warmup: 0.1
  • Batch Size: 16
  • Optimizer: AdamW
  • Weight Decay: 0.01
  • Number of Labels: 17 (8 entity types + BIO tagging)
  • Framework: PyTorch + Hugging Face Transformers

Intended Uses

Primary Use Cases

  • Extracting structured voting data from Portuguese municipal council minutes
  • Automating analysis of voting patterns in Portuguese governmental documents
  • Creating datasets of voting records from unstructured text
  • Supporting transparency and civic engagement initiatives

Out-of-Scope Uses

  • General-purpose NER for Portuguese (use domain-general models instead)
  • Voting information extraction in other languages
  • Real-time classification (model is large and may be slow)
  • Legal decision-making without human review

Labels

The model predicts 17 labels using BIO (Begin-Inside-Outside) tagging scheme:

Entity Types

Label Description Example
VOTING The voting action itself "votação", "aprovação por votação"
SUBJECT What is being voted on "proposta de orçamento", "regulamento municipal"
COUNTING-MAJORITY Majority-based vote counting "maioria", "por maioria"
COUNTING-UNANIMITY Unanimous vote counting "unanimidade", "por unanimidade"
VOTER-FAVOR Voters in favor "a favor", "votos favoráveis"
VOTER-AGAINST Voters against "contra", "votos contra"
VOTER-ABSTENTION Voters abstaining "abstenção", "absteve-se"
VOTER-ABSENT Absent voters "ausente", "falta"
O Outside any entity -

BIO Tagging

  • B-{ENTITY}: Beginning of an entity
  • I-{ENTITY}: Inside/continuation of an entity
  • O: Outside any entity (not part of a voting-related span)

Performance

Evaluated on a test set of 529 examples from Portuguese municipal council minutes:

Entity-Level Metrics (Strict Matching)

Metric Value
Precision 57.73%
Recall 90.44%
F1 Score 70.48%
Accuracy 98.31%

Per-Entity Type Performance

Entity Type Precision Recall F1 Score Support
VOTING 93.54% 99.14% 96.26% 467
VOTER-ABSTENTION 93.66% 100.00% 96.73% 133
VOTER-FAVOR 89.30% 97.33% 93.14% 300
COUNTING-MAJORITY 86.96% 96.77% 91.60% 62
VOTER-AGAINST 81.82% 100.00% 90.00% 36
VOTER-ABSENT 76.19% 88.89% 82.05% 18
SUBJECT 55.17% 80.35% 65.42% 458
COUNTING-UNANIMITY 23.58% 79.02% 36.32% 305

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "your-username/bertimbau-large-vote-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example text
text = "A proposta foi aprovada por maioria, com 5 votos a favor e 2 contra."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[pred.item()] for pred in predictions[0]]

# Print results
for token, label in zip(tokens, labels):
    if label != "O":
        print(f"{token}: {label}")

Advanced Usage with Pipeline

from transformers import pipeline

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # Groups B- and I- tags
)

# Extract entities
text = "A câmara deliberou por unanimidade aprovar o regulamento municipal."
entities = ner_pipeline(text)

for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} (score: {entity['score']:.4f})")

Output Example

proposta: SUBJECT (score: 0.9234)
aprovada: VOTING (score: 0.9678)
maioria: COUNTING-MAJORITY (score: 0.9123)
votos a favor: VOTER-FAVOR (score: 0.8956)
contra: VOTER-AGAINST (score: 0.9345)

Training Procedure

Hyperparameters

  • Base Model: neuralmind/bert-large-portuguese-cased
  • Architecture: Linear classification layer on top of BERT embeddings
  • Best Checkpoint: Step 2100
  • Evaluation Examples: 529

Computational Resources

  • Model Size: ~647 MB (safetensors format)
  • Precision: float32

Limitations and Biases

Limitations

  1. Domain-Specific: Optimized for municipal council minutes; may not generalize well to other document types or voting contexts
  2. Language Variant: Trained on European Portuguese; performance on Brazilian Portuguese may vary
  3. Entity Complexity: Lower performance on COUNTING-UNANIMITY (36.32% F1) suggests difficulty with certain entity types
  4. Model Size: 647 MB model may be too large for resource-constrained environments
  5. Context Length: Limited to 512 tokens (BERT constraint)

Known Issues

  • COUNTING-UNANIMITY: Significantly lower precision (23.58%) indicates many false positives for unanimity-related phrases
  • SUBJECT entities: Moderate precision (55.17%) suggests subject boundaries can be mprecise
  • Subject extraction may include extraneous context or miss complete subject descriptions

Potential Biases

  • Training data reflects Portuguese municipal governance language and may not capture regional variations
  • May reflect biases present in governmental documents
  • Performance may vary across different municipalities or time periods.

License

This model is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Acknowledgments

  • Built on BERTimbau by NeuralMind
  • Developed for improving transparency in Portuguese municipal governance
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including liaad/Citilink-BERTimbau-large-Vote-Identification-pt-baseline

Evaluation results