Darija Toxicity Classifier π²π¦
A transformer-based NLP model for detecting toxic content in Moroccan Darija and Arabizi.
This model is specifically designed to handle the linguistic complexity of Moroccan dialect, including Arabizi (Arabic written in Latin characters with numbers) such as:
3β ΨΉ7β Ψ9β Ω
It also supports code-switched text mixing Darija, Arabic, French, English, and Tamazight.
π Model Overview
| Property | Value |
|---|---|
| Model ID | 0khacha/darija-toxicity-classifier |
| Architecture | Fine-tuned from SI2M-Lab/DarijaBERT-arabizi |
| Task | Binary Sequence Classification (Safe / Toxic) |
| Framework | Hugging Face Transformers |
| Training Data | 16,000+ labeled Moroccan Darija/Arabizi samples |
π Quick Inference (Transformers)
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="0khacha/darija-toxicity-classifier"
)
result = classifier("salam khouya")
print(result)
# Output: [{'label': 'SAFE', 'score': 0.9845}]
π§ What Makes This Model Special?
π Dialect-Aware
Built specifically for Moroccan linguistic patterns β not generic Arabic.
π’ Arabizi Handling
Understands numeric character substitutions like:
in3alsa7a3likom
π§Ή Custom Preprocessing
The model was trained with specialized normalization:
- Lowercasing
- Removing dash/underscore splitting (
w-a-l-oβwalo) - Fixing spaced characters (
n 3 a lβn3al) - Reducing elongation (
heeeeyβhey) - Whitespace normalization
π Performance
| Metric | Score |
|---|---|
| Accuracy | ~94% |
| F1-Score | ~93% |
| Inference Speed (GPU) | ~50ms |
Note: Performance may vary depending on hardware and deployment setup.
π Example Predictions
Example 1: Safe Content
Input:
"bghit nakol"
Output:
Safe (98.45%)
Example 2: Toxic Content
Input:
"rak stupid"
Output:
Toxic
β οΈ Limitations
- May struggle with extremely rare slang
- Context-dependent toxicity (sarcasm) may reduce accuracy
- Not intended for legal or automated moderation without human review
π Dataset & Privacy
The training dataset is not publicly available for privacy and ethical reasons.
For research collaboration: π© [email protected]
π License
MIT License
π Acknowledgments
- DarijaBERT team at SI2M-Lab
- Hugging Face Transformers ecosystem
- PyTorch
- The Moroccan NLP community
π Citation
If you use this model in your research, please cite:
@misc{darija-toxicity-classifier,
author = {Khacha, Mohamed},
title = {Darija Toxicity Classifier},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/0khacha/darija-toxicity-classifier}
}
π€ Contributing
Contributions, issues, and feature requests are welcome!
Feel free to check the issues page.
Made with β€οΈ for the Moroccan NLP community
- Downloads last month
- 28
Model tree for 0khacha/darija-toxicity-classifier
Base model
SI2M-Lab/DarijaBERT-arabizi