File size: 8,461 Bytes
6cdd194
19c9a07
6cdd194
19c9a07
6cdd194
 
19c9a07
 
6cdd194
 
 
 
 
 
 
 
19c9a07
 
6cdd194
 
19c9a07
2d6dc41
19c9a07
6cdd194
19c9a07
da3874c
19c9a07
c3d285d
19c9a07
6cdd194
19c9a07
e402267
6cdd194
 
e402267
6cdd194
 
19c9a07
6cdd194
19c9a07
6cdd194
 
 
 
 
 
 
 
 
 
 
 
19c9a07
6cdd194
19c9a07
6cdd194
19c9a07
6cdd194
 
 
 
 
19c9a07
6cdd194
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19c9a07
 
6cdd194
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3d285d
6cdd194
 
 
c3d285d
6cdd194
 
 
 
 
 
c3d285d
6cdd194
 
 
 
 
 
 
 
 
 
 
 
 
c3d285d
e402267
6cdd194
e402267
 
6cdd194
 
1dc10a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a288d84
 
 
 
 
 
 
 
 
 
 
 
 
 
6cdd194
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e402267
6cdd194
e402267
6cdd194
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
language:
- pt
license: cc-by-nc-nd-4.0
colorTo: blue
sdk: streamlit
app_port: 8501
tags:
- streamlit
- text-classification
- multi-label-classification
- gradient-boosting
- active-learning
- bertimbau
- municipal-documents
- meeting-minutes
library_name: transformers
base_model:
- neuralmind/bert-base-portuguese-cased
---

# Council Topics Classifier: Multi-Label Topic Classification for Portuguese Council Texts Discussion Subjects

## Model Description

**Council Topics Classifier** is an ensemble machine learning system specialized in **multi-label topic classification** for Portuguese municipal council meeting minutes subjects. The model combines Gradient Boosting with Active Learning and BERTimbau embeddings to identify multiple simultaneous topics within municipal discussion subjects, making it particularly effective for categorizing complex governmental content.

🚀 **Try out the model:** [Demo Council Topics Classifier PT](https://huggingface.co/spaces/anonymous12321/Council_Topics_Classifier_PT)

## Key Features

- 🎯 **Specialized for Municipal Topics**: Trained on Portuguese council meeting minutes discussion subjects with domain-specific preprocessing
- 🏆 **Advanced Ensemble**: Combines LogisticRegression + 3x GradientBoosting models with adaptive weighting
- 🧠 **Deep + Classical Features**: Merges TF-IDF vectors (10k features) with BERTimbau embeddings (768 dims)
- 📊 **Multi-Label Classification**: Identifies multiple co-occurring topics per subject
-**Optimized Thresholds**: Dynamic per-label thresholds tuned on validation data
- 🔄 **Active Learning Ready**: Adaptive weighting based on label frequency for continuous improvement

## Model Details

- **Architecture**: Ensemble (LogisticRegression + 3x GradientBoosting)
- **Base Models**:
  - 1x LogisticRegression (L2 regularization, C=1.0)
  - GradientBoosting Model #1 (n_estimators=100, max_depth=3, learning_rate=0.1)
  - GradientBoosting Model #2 (n_estimators=150, max_depth=5, learning_rate=0.05)
  - GradientBoosting Model #3 (n_estimators=200, max_depth=4, learning_rate=0.1)
- **Feature Extractor**: TF-IDF (n-grams 1-3, 10k features, Portuguese stopwords)
- **Embedding Model**: neuralmind/bert-base-portuguese-cased (BERTimbau)
- **Total Features**: 10,768 dimensions (10k TF-IDF + 768 BERT)
- **Training Method**: One-vs-Rest with class weighting + Focal Loss
- **Optimization**: Adaptive ensemble weighting by label frequency
- **Framework**: Scikit-learn + PyTorch + Transformers

## How It Works

The model processes Portuguese municipal texts through a sophisticated pipeline to identify relevant topics:

1. **Portuguese-Specific Preprocessing**
   - Lowercasing and normalization
   - Municipal entity recognition (e.g., "Câmara Municipal" → "camara_municipal")
   - Legal term preservation (e.g., "Art. 5" → "artigo_5")
   - Number and currency standardization

2. **Dual Feature Extraction**
   - **TF-IDF**: Captures term frequency patterns with n-grams (1-3)
   - **BERTimbau**: Provides contextual semantic embeddings

3. **Ensemble Prediction**
   - Each base model predicts probabilities for all labels
   - Adaptive weighted combination based on label rarity:
     - **Rare labels**: Higher LogisticRegression weight
     - **Common labels**: Higher GradientBoosting weight

4. **Dynamic Thresholding**
   - Per-label optimal thresholds (not fixed 0.5)
   - Optimized for F1-score on validation set


## Usage

```python
import numpy as np
from joblib import load
from transformers import AutoTokenizer, AutoModel
import torch

# Load models
models_dir = 'models'
tfidf = load(f'{models_dir}/tfidf_vectorizer.joblib')
mlb = load(f'{models_dir}/mlb_encoder.joblib')
optimal_thresholds = np.load(f'{models_dir}/optimal_thresholds.npy')
adaptive_weights = np.load(f'{models_dir}/adaptive_weights.npy')
logistic_model = load(f'{models_dir}/logistic_model.joblib')
gb_models = load(f'{models_dir}/gb_models.joblib')

# Load BERTimbau
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
bert_model = AutoModel.from_pretrained("neuralmind/bert-base-portuguese-cased").to(device)

# Preprocess text
text = "A Câmara Municipal aprovou o orçamento de 2024..."
# (apply smart_preprocess function - see demo source code)

# Extract features
tfidf_features = tfidf.transform([text])
# (extract BERT embeddings - see demo source code)

# Combine features and predict
X_combined = np.hstack([tfidf_features.toarray(), bert_embeddings])

# Get ensemble predictions
logistic_proba = logistic_model.predict_proba(X_combined)
# (apply GB models and adaptive weighting - see demo source code)

# Apply optimal thresholds
predictions = (ensemble_proba >= optimal_thresholds).astype(int)
predicted_labels = mlb.inverse_transform(predictions)

print(f"Predicted Topics: {predicted_labels}")
```


## Dataset

The model was trained on a curated dataset of Portuguese municipal council meeting minutes:

- **Documents**: 2,500+ meeting minutes discussion subjects
- **Time Period**: 2021-2024
- **Source**: Portuguese municipalities (anonymized)
- **Labels**: 22 topic categories
- **Annotation**: Multi-label (avg. 1.69 labels per document)
- **Split**: 60% train / 20% validation / 20% test

## Categories

The model classifies topics into 22 Portuguese administrative categories:

| Category | Portuguese Name |
|----------|-----------------|
| General Administration | Administração Geral, Finanças e Recursos Humanos |
| Environment | Ambiente |
| Economic Activities | Atividades Económicas |
| Social Action | Ação Social |
| Science | Ciência |
| Communication | Comunicação e Relações Públicas |
| External Cooperation | Cooperação Externa e Relações Internacionais |
| Culture | Cultura |
| Sports | Desporto |
| Education | Educação e Formação Profissional |
| Energy & Telecommunications | Energia e Telecomunicações |
| Housing | Habitação |
| Private Construction | Obras Particulares |
| Public Works | Obras Públicas |
| Territorial Planning | Ordenamento do Território |
| Other | Outros |
| Heritage | Património |
| Municipal Police | Polícia Municipal |
| Animal Protection | Proteção Animal |
| Civil Protection | Proteção Civil |
| Health | Saúde |
| Traffic & Transport | Trânsito, Transportes e Comunicações |


## Evaluation Results

### Comprehensive Performance Metrics

| Metric | Score | Description |
|--------|-------|-------------|
| **F1-macro** | **0.5485** | Macro-averaged F1 score |
| **F1-micro** | **0.7363** | Micro-averaged F1 score |
| **F1-weighted** | **0.742** | Weighted-averaged F1 score |
| **Accuracy** | **0.4518** | Subset accuracy (exact match) |
| **Hamming Loss** | **0.0412** | Label-wise error rate |
| **Average Precision (macro)** | **0.606** | Macro-averaged AP |
| **Average Precision (micro)** | **0.734** | Micro-averaged AP |

## Training Details

### Preprocessing
- Portuguese stopword removal
- Municipal entity recognition
- Legal term preservation
- N-gram extraction (1-3)

### Feature Engineering
- TF-IDF: 10,000 features with sublinear scaling
- BERTimbau: Mean-pooled embeddings (768 dims)
- Feature concatenation: 10,768 total dimensions

### Model Training
- **Strategy**: One-vs-Rest multi-label classification
- **Class Balancing**: Inverse frequency weighting
- **Validation**: Stratified 5-fold cross-validation
- **Threshold Optimization**: Per-label F1-maximization
- **Active Learning**: Adaptive ensemble weights

### Hyperparameters

**LogisticRegression:**
```python
{
    'penalty': 'l2',
    'C': 1.0,
    'max_iter': 1000,
    'class_weight': 'balanced'
}
```

**GradientBoosting Models:**
```python
# Model #1
{'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1}

# Model #2
{'n_estimators': 150, 'max_depth': 5, 'learning_rate': 0.05}

# Model #3
{'n_estimators': 200, 'max_depth': 4, 'learning_rate': 0.1}
```

## Limitations

- **Language Specificity**: Optimized for Portuguese
- **Domain Focus**: Best performance on municipal/administrative texts
- **Label Set**: Fixed to 22 predefined categories
- **Rare Topics**: Lower performance on infrequent labels (<20 training examples)
- **Ambiguous Cases**: May over-predict for texts with multiple overlapping themes


## License

This model is released under the **Attribution-NonCommercial-NoDerivatives 4.0 International** (CC BY-NC-ND 4.0).

---