JengaAI Multi-Task NLP (3-Task Attention Fusion)
A multi-task NLP model built with the JengaAI framework that performs fraud detection, sentiment analysis, and call quality scoring simultaneously through a shared encoder with attention-based task fusion. Designed for Kenyan national security and telecommunications applications.
Model Capabilities
This model handles 3 tasks with 8 prediction heads producing 22 total output dimensions in a single forward pass:
| Task | Type | Heads | Outputs | Best F1 |
|---|---|---|---|---|
| Fraud Detection | Binary classification | 1 (fraud) | 2 classes: normal / fraud | 1.000 |
| Sentiment Analysis | 3-class classification | 1 (sentiment) | 3 classes: negative / neutral / positive | 0.167 |
| Call Quality Scoring | Multi-label QA | 6 heads, 17 sub-metrics | Binary per sub-metric | 0.646 - 0.967 |
Call Quality Sub-Metrics (17 Binary Outputs)
The call quality task evaluates customer service transcripts across 6 quality dimensions:
| Head | Sub-Metrics | F1 |
|---|---|---|
| Opening | greeting | 0.967 |
| Listening | acknowledgment, empathy, clarification, active_listening, patience | 0.922 |
| Proactiveness | initiative, follow_up, suggestions | 0.802 |
| Resolution | identified_issue, provided_solution, confirmed_resolution, set_expectations, offered_alternatives | 0.908 |
| Hold | asked_permission, explained_reason | 0.647 |
| Closing | proper_farewell | 0.881 |
Architecture
Input Text
|
v
[DistilBERT Encoder] ---- 6 layers, 768 hidden, 12 attention heads
|
v
[Attention Fusion] ------- task-conditioned attention with residual connections
|
+-- [Task 0: Fraud Head] ----------- Linear(768, 2) --> softmax
+-- [Task 1: Sentiment Head] ------- Linear(768, 3) --> softmax
+-- [Task 2: QA Scoring 6 Heads] --- 6x Linear(768, 1..5) --> sigmoid
Key design choices:
- Shared encoder: All 3 tasks share a single DistilBERT encoder, enabling knowledge transfer between fraud patterns, sentiment signals, and call quality indicators
- Attention fusion: A learned attention mechanism modulates the shared representation per task, allowing each task to attend to different parts of the encoder output while still benefiting from shared features
- Residual connections: Fusion output is added to the original representation (gate_init_value=0.5), ensuring stable training and allowing each task to fall back on the base representation
- Multi-head QA: Call quality uses 6 independent classification heads with different output sizes (1-5 binary outputs each), weighted by importance during training (resolution: 2.0x, listening: 1.5x, hold: 0.5x)
Usage
With JengaAI Framework (Recommended)
pip install torch transformers pydantic pyyaml huggingface_hub
from huggingface_hub import snapshot_download
from jenga_ai.inference import InferencePipeline
# Download model
model_path = snapshot_download(
"Rogendo/JengaAI-multi-task-nlp",
ignore_patterns=["checkpoints/*", "logs/*"],
)
# Load pipeline
pipeline = InferencePipeline.from_checkpoint(
model_dir=model_path,
config_path=f"{model_path}/experiment_config.yaml",
device="auto",
)
# Run all 3 tasks at once
result = pipeline.predict("Suspicious M-Pesa transaction from unknown account")
print(result.to_json())
# Or run a single task
fraud_result = pipeline.predict(
"WARNING: Your Safaricom account has been compromised. Send 5000 KES to unlock.",
task_name="fraud_detection",
)
fraud = fraud_result.task_results["fraud_detection"].heads["fraud"]
print(f"Fraud: {fraud.prediction} (confidence: {fraud.confidence:.1%})")
# Fraud: 1 (confidence: 96.9%)
Batch Inference
texts = [
"Suspicious M-Pesa notification asking me to send money.",
"Normal airtime top-up of 100 KES via M-Pesa.",
"WARNING: Your account has been compromised.",
]
results = pipeline.predict_batch(texts, task_name="fraud_detection", batch_size=32)
for text, result in zip(texts, results):
fraud = result.task_results["fraud_detection"].heads["fraud"]
label = "FRAUD" if fraud.prediction == 1 else "LEGIT"
print(f"[{label} {fraud.confidence:.1%}] {text}")
CLI
# Single text
python -m jenga_ai predict \
--config experiment_config.yaml \
--model-dir ./model \
--text "Suspicious M-Pesa transaction from unknown account" \
--format report
# Batch from file
python -m jenga_ai predict \
--config experiment_config.yaml \
--model-dir ./model \
--input-file transcripts.jsonl \
--output predictions.json \
--batch-size 16
Call Quality Scoring Example
result = pipeline.predict(
"Hello, welcome to Safaricom customer care. I understand you're having "
"a billing issue. Let me look into that for you right away. I've found "
"the discrepancy and corrected your balance. Is there anything else?",
task_name="call_quality",
)
for head_name, head in result.task_results["call_quality"].heads.items():
print(f"{head_name:16s} {head.prediction} (conf: {head.confidence:.2f})")
Output:
opening {'greeting': True} (conf: 0.82)
listening {'acknowledgment': True, 'empathy': True, ...} (conf: 0.75)
proactiveness {'initiative': True, 'follow_up': True, 'suggestions': False} (conf: 0.58)
resolution {'identified_issue': True, 'provided_solution': True, ...} (conf: 0.69)
hold {'asked_permission': False, 'explained_reason': False} (conf: 0.02)
closing {'proper_farewell': True} (conf: 0.52)
Low-Level Usage (Without JengaAI Framework)
If you only need the raw model weights and want to integrate into your own pipeline:
import torch
import json
from transformers import AutoTokenizer, AutoModel, AutoConfig
# Load components
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
encoder_config = AutoConfig.from_pretrained("./model/encoder_config")
with open("./model/metadata.json") as f:
metadata = json.load(f)
# Load full state dict
state_dict = torch.load("./model/model.pt", map_location="cpu", weights_only=True)
# Extract encoder weights (keys starting with "encoder.")
encoder_state = {k.replace("encoder.", ""): v for k, v in state_dict.items() if k.startswith("encoder.")}
encoder = AutoModel.from_config(encoder_config)
encoder.load_state_dict(encoder_state)
encoder.eval()
# Run encoder
inputs = tokenizer("Suspicious transaction", return_tensors="pt", padding="max_length",
truncation=True, max_length=256)
with torch.no_grad():
outputs = encoder(**inputs)
cls_embedding = outputs.last_hidden_state[:, 0] # [1, 768]
# Extract fraud head weights (task 0, head "fraud")
fraud_weight = state_dict["tasks.0.heads.fraud.1.weight"] # [2, 768]
fraud_bias = state_dict["tasks.0.heads.fraud.1.bias"] # [2]
logits = cls_embedding @ fraud_weight.T + fraud_bias
probs = torch.softmax(logits, dim=-1)
print(f"Fraud probability: {probs[0, 1].item():.4f}")
Intended Use
Primary Use Cases
- M-Pesa Fraud Detection: Classify M-Pesa transaction descriptions as fraudulent or legitimate. Designed for Safaricom and Kenyan mobile money contexts.
- Customer Sentiment Monitoring: Analyze customer feedback and communications for sentiment polarity (negative / neutral / positive).
- Call Center Quality Assurance: Score customer service call transcripts across 17 quality sub-metrics in 6 categories, replacing manual QA audits.
- Multi-Signal Analysis: Run all 3 tasks simultaneously on the same text to get a comprehensive analysis (is this a fraud attempt? what's the sentiment? how good was the agent's response?).
Intended Users
- Kenyan telecommunications companies (Safaricom, Airtel Kenya)
- Financial institutions monitoring mobile money transactions
- Call center operations teams performing quality audits
- Security analysts processing incident reports
- NLP researchers working on African language and context models
Downstream Use
The model can be integrated into:
- Real-time fraud alerting systems
- Call center dashboards with automated QA scoring
- Customer feedback analysis pipelines
- Security operations center (SOC) threat triage workflows
- Mobile money transaction monitoring platforms
Out-of-Scope Use
- Not for automated decision-making without human oversight. This model should support human analysts, not replace them. High-stakes fraud decisions require human review.
- Not for non-Kenyan contexts without retraining. Entity names, transaction patterns, and call center norms are Kenyan-specific.
- Not for languages other than English. While some Swahili words appear in the training data (M-Pesa, Safaricom, KRA), the model is primarily English.
- Not for legal evidence. Model outputs are analytical signals, not forensic evidence.
- Not for surveillance of individuals. The model analyzes text content, not identity.
Bias, Risks, and Limitations
Known Biases
- Training data imbalance: Fraud detection was trained on only 20 samples (16 train / 4 eval). The model achieves 1.0 F1 on eval but this is likely due to the tiny eval set and potential overfitting. Real-world fraud patterns are far more diverse.
- Sentiment data: Only 15 samples, with accuracy stuck at 33.3% (random baseline for 3 classes). The sentiment head needs significantly more training data to be production-useful.
- Call quality data: 4,996 synthetic transcripts. While metrics are strong (0.65-0.97 F1), the synthetic nature means real-world transcripts with noise, code-switching (Swahili-English), and non-standard grammar may perform differently.
- Geographic bias: All training data reflects Kenyan contexts. The model may not generalize to other East African countries without adaptation.
Risks
- False positives in fraud detection: Legitimate transactions flagged as fraud can block real users. Always use this model with human review for enforcement actions.
- False negatives in fraud detection: Sophisticated fraud patterns not in the training data will be missed. This model is one signal among many, not a standalone detector.
- Over-reliance on QA scores: Call quality scores should augment, not replace, human QA reviewers. Edge cases (cultural nuances, sarcasm, escalation scenarios) may be scored incorrectly.
Recommendations
- Use fraud detection as a triage signal (flag for review), not an automatic block
- Retrain with production-scale data before deploying to production
- Monitor prediction confidence — route low-confidence predictions to human review using the built-in HITL routing (
enable_hitl=True) - Enable PII redaction (
enable_pii=True) when processing real customer data - Enable audit logging (
enable_audit=True) for compliance and accountability
Training Details
Training Data
| Dataset | Task | Samples | Source |
|---|---|---|---|
sample_classification.jsonl |
Fraud Detection | 20 | Synthetic M-Pesa transaction descriptions |
sample_sentiment.jsonl |
Sentiment Analysis | 15 | Synthetic customer feedback |
synthetic_qa_metrics_data_v01x.json |
Call Quality | 4,996 | Synthetic call center transcripts with 17 binary QA labels |
Train/eval split: 80/20 random split (seed=42)
All datasets are synthetic, generated to reflect linguistic patterns in Kenyan telecommunications and financial services contexts. They contain English text with occasional Swahili terms and Kenyan-specific entities (M-Pesa, Safaricom, KRA, Kenyan phone numbers).
Training Procedure
Preprocessing
- Tokenizer:
distilbert-base-uncasedWordPiece tokenizer - Max sequence length: 256 tokens
- Padding:
max_length(padded to 256) - Truncation: enabled
Architecture
- Encoder: DistilBERT (6 layers, 768 hidden, 12 heads) — 66.4M parameters
- Fusion: Attention fusion with residual connections — 1.2M parameters
- Task heads: 8 linear heads across 3 tasks — 17K parameters
- Total: 67.6M parameters (258MB on disk)
Training Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Batch size | 16 |
| Epochs | 12 (best checkpoint at epoch 3) |
| Weight decay | 0.01 |
| Warmup steps | 20 |
| Max gradient norm | 1.0 |
| Optimizer | AdamW |
| Precision | FP32 |
| Task sampling | Proportional (temperature=2.0) |
| Early stopping patience | 5 epochs |
| Best model metric | eval_loss |
Task Loss Weights
| Head | Weight | Rationale |
|---|---|---|
| fraud | 1.0 | Standard |
| sentiment | 1.0 | Standard |
| opening | 1.0 | Standard |
| listening | 1.5 | Important quality dimension |
| proactiveness | 1.0 | Standard |
| resolution | 2.0 | Most critical quality dimension |
| hold | 0.5 | Less frequent in transcripts |
| closing | 1.0 | Standard |
Training Loss Progression
| Epoch | Train Loss | Eval Loss | Status |
|---|---|---|---|
| 3 | 1.878 | 1.948 | Best checkpoint |
| 7 | 1.471 | 2.057 | Overfitting begins |
| 8 | 1.403 | 2.068 | Continued overfitting |
The best checkpoint was selected at epoch 3 based on eval_loss. Training continued to epoch 12 but eval loss increased after epoch 3, indicating overfitting — expected given the small fraud and sentiment datasets.
Speeds, Sizes, Times
| Metric | Value |
|---|---|
| Model size (disk) | 258 MB |
| Parameters | 67.6M |
| Inference latency (single task, CPU) | ~590 ms |
| Inference latency (all 3 tasks, CPU) | ~1,960 ms |
| Batch throughput (32 texts, single task, CPU) | ~647 ms/sample |
| Training time | ~5 minutes (CPU, 12 epochs) |
Evaluation
Metrics
All metrics are computed on the 20% held-out eval split.
Fraud Detection (binary classification):
| Metric | Value |
|---|---|
| Accuracy | 1.000 |
| Precision | 1.000 |
| Recall | 1.000 |
| F1 | 1.000 |
Sentiment Analysis (3-class classification):
| Metric | Value |
|---|---|
| Accuracy | 0.333 |
| Precision | 0.111 |
| Recall | 0.333 |
| F1 | 0.167 |
Call Quality (multi-label binary per head):
| Head | Precision | Recall | F1 |
|---|---|---|---|
| Opening | 0.967 | 0.967 | 0.967 |
| Listening | 0.893 | 0.953 | 0.922 |
| Proactiveness | 0.746 | 0.868 | 0.802 |
| Resolution | 0.918 | 0.898 | 0.908 |
| Hold | 0.856 | 0.519 | 0.647 |
| Closing | 0.881 | 0.881 | 0.881 |
Results Summary
- Fraud detection achieves perfect metrics on the eval set, but this is a very small eval set (4 samples). Production deployment requires evaluation on a larger, more diverse dataset.
- Sentiment analysis performs at random baseline (33.3% accuracy for 3 classes), indicating the 15-sample dataset is insufficient. This head needs retraining with production data.
- Call quality shows strong performance across most heads (0.80-0.97 F1), with the "hold" category being the weakest (0.647 F1) due to fewer hold-related examples in the training data.
Model Examination
Attention Fusion
The attention fusion mechanism learns task-specific attention patterns over the shared encoder output. This allows:
- The fraud head to attend to transaction-related tokens (amounts, account references)
- The sentiment head to attend to opinion-bearing words
- The QA heads to attend to conversational flow patterns
The fusion uses a gated residual connection (initialized at 0.5), meaning each task's representation is a learned blend of the task-specific attended output and the original encoder output.
Security Features
When used with the JengaAI inference framework, the model supports:
- PII Redaction: Masks Kenyan-specific PII (phone numbers, national IDs, KRA PINs, M-Pesa transaction IDs) before inference
- Explainability: Token-level importance scores via attention analysis or gradient methods
- Human-in-the-Loop: Automatic routing of low-confidence predictions to human reviewers based on entropy-based uncertainty estimation
- Audit Trail: Tamper-evident logging of every inference call with SHA-256 hash chains
Technical Specifications
Model Architecture and Objective
- Architecture: DistilBERT encoder + attention fusion + multi-task heads
- Encoder: 6 transformer layers, 768 hidden size, 12 attention heads, 30,522 vocab
- Fusion: Single-head attention with residual gating
- Objectives: CrossEntropy (fraud, sentiment) + BCEWithLogits (call quality)
Compute Infrastructure
Hardware
- Training: CPU (Intel/AMD, standard workstation)
- Inference: CPU or CUDA GPU
Software
- PyTorch 2.x
- Transformers 5.x
- JengaAI Framework V2
- Python 3.11+
Environmental Impact
- Hardware Type: CPU (standard workstation)
- Training Time: ~5 minutes
- Carbon Emitted: Negligible (short training run on CPU)
Citation
@software{jengaai2026,
title = {JengaAI: Low-Code Multi-Task NLP for African Security Applications},
author = {Rogendo},
year = {2026},
url = {https://huggingface.co/Rogendo/JengaAI-multi-task-nlp},
}
Model Card Authors
Rogendo
Model Card Contact
For questions, issues, or contributions: GitHub Issues
Model tree for Rogendo/JengaAI-multi-task-distilbert-base-uncased
Base model
distilbert/distilbert-base-uncasedCollection including Rogendo/JengaAI-multi-task-distilbert-base-uncased
Evaluation results
- F1self-reported1.000
- Accuracyself-reported1.000
- F1self-reported0.167
- Accuracyself-reported0.333
- F1self-reported0.922
- F1self-reported0.908