Darija Toxicity Classifier 🇲🇦

A transformer-based NLP model for detecting toxic content in Moroccan Darija and Arabizi.

This model is specifically designed to handle the linguistic complexity of Moroccan dialect, including Arabizi (Arabic written in Latin characters with numbers) such as:

3 → ع
7 → ح
9 → ق

It also supports code-switched text mixing Darija, Arabic, French, English, and Tamazight.

📌 Model Overview

Property	Value
Model ID	`0khacha/darija-toxicity-classifier`
Architecture	Fine-tuned from `SI2M-Lab/DarijaBERT-arabizi`
Task	Binary Sequence Classification (Safe / Toxic)
Framework	Hugging Face Transformers
Training Data	16,000+ labeled Moroccan Darija/Arabizi samples

🚀 Quick Inference (Transformers)

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="0khacha/darija-toxicity-classifier"
)

result = classifier("salam khouya")
print(result)
# Output: [{'label': 'SAFE', 'score': 0.9845}]

🧠 What Makes This Model Special?

🌍 Dialect-Aware

Built specifically for Moroccan linguistic patterns — not generic Arabic.

🔢 Arabizi Handling

Understands numeric character substitutions like:

in3al
sa7a
3likom

🧹 Custom Preprocessing

The model was trained with specialized normalization:

Lowercasing
Removing dash/underscore splitting (w-a-l-o → walo)
Fixing spaced characters (n 3 a l → n3al)
Reducing elongation (heeeey → hey)
Whitespace normalization

📊 Performance

Metric	Score
Accuracy	~94%
F1-Score	~93%
Inference Speed (GPU)	~50ms

Note: Performance may vary depending on hardware and deployment setup.

📖 Example Predictions

Example 1: Safe Content

Input:

"bghit nakol"

Output:

Safe (98.45%)

Example 2: Toxic Content

Input:

"rak stupid"

Output:

Toxic

⚠️ Limitations

May struggle with extremely rare slang
Context-dependent toxicity (sarcasm) may reduce accuracy
Not intended for legal or automated moderation without human review

🔒 Dataset & Privacy

The training dataset is not publicly available for privacy and ethical reasons.

For research collaboration: 📩 [email protected]

📜 License

MIT License

🙏 Acknowledgments

DarijaBERT team at SI2M-Lab
Hugging Face Transformers ecosystem
PyTorch
The Moroccan NLP community

📚 Citation

If you use this model in your research, please cite:

@misc{darija-toxicity-classifier,
  author = {Khacha, Mohamed},
  title = {Darija Toxicity Classifier},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/0khacha/darija-toxicity-classifier}
}

🤝 Contributing

Contributions, issues, and feature requests are welcome!

Feel free to check the issues page.

Made with ❤️ for the Moroccan NLP community

Downloads last month: 28

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for 0khacha/darija-toxicity-classifier

Base model

SI2M-Lab/DarijaBERT-arabizi

Finetuned

(2)

this model

0khacha
/

darija-toxicity-classifier