Darija Toxicity Classifier πŸ‡²πŸ‡¦

A transformer-based NLP model for detecting toxic content in Moroccan Darija and Arabizi.

This model is specifically designed to handle the linguistic complexity of Moroccan dialect, including Arabizi (Arabic written in Latin characters with numbers) such as:

  • 3 β†’ ΨΉ
  • 7 β†’ Ψ­
  • 9 β†’ Ω‚

It also supports code-switched text mixing Darija, Arabic, French, English, and Tamazight.


πŸ“Œ Model Overview

Property Value
Model ID 0khacha/darija-toxicity-classifier
Architecture Fine-tuned from SI2M-Lab/DarijaBERT-arabizi
Task Binary Sequence Classification (Safe / Toxic)
Framework Hugging Face Transformers
Training Data 16,000+ labeled Moroccan Darija/Arabizi samples

πŸš€ Quick Inference (Transformers)

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="0khacha/darija-toxicity-classifier"
)

result = classifier("salam khouya")
print(result)
# Output: [{'label': 'SAFE', 'score': 0.9845}]

🧠 What Makes This Model Special?

🌍 Dialect-Aware

Built specifically for Moroccan linguistic patterns β€” not generic Arabic.

πŸ”’ Arabizi Handling

Understands numeric character substitutions like:

  • in3al
  • sa7a
  • 3likom

🧹 Custom Preprocessing

The model was trained with specialized normalization:

  • Lowercasing
  • Removing dash/underscore splitting (w-a-l-o β†’ walo)
  • Fixing spaced characters (n 3 a l β†’ n3al)
  • Reducing elongation (heeeey β†’ hey)
  • Whitespace normalization

πŸ“Š Performance

Metric Score
Accuracy ~94%
F1-Score ~93%
Inference Speed (GPU) ~50ms

Note: Performance may vary depending on hardware and deployment setup.


πŸ“– Example Predictions

Example 1: Safe Content

Input:

"bghit nakol"

Output:

Safe (98.45%)

Example 2: Toxic Content

Input:

"rak stupid"

Output:

Toxic

⚠️ Limitations

  • May struggle with extremely rare slang
  • Context-dependent toxicity (sarcasm) may reduce accuracy
  • Not intended for legal or automated moderation without human review

πŸ”’ Dataset & Privacy

The training dataset is not publicly available for privacy and ethical reasons.

For research collaboration: πŸ“© [email protected]


πŸ“œ License

MIT License


πŸ™ Acknowledgments

  • DarijaBERT team at SI2M-Lab
  • Hugging Face Transformers ecosystem
  • PyTorch
  • The Moroccan NLP community

πŸ“š Citation

If you use this model in your research, please cite:

@misc{darija-toxicity-classifier,
  author = {Khacha, Mohamed},
  title = {Darija Toxicity Classifier},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/0khacha/darija-toxicity-classifier}
}

🀝 Contributing

Contributions, issues, and feature requests are welcome!

Feel free to check the issues page.


Made with ❀️ for the Moroccan NLP community

Downloads last month
28
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 0khacha/darija-toxicity-classifier

Finetuned
(2)
this model

Space using 0khacha/darija-toxicity-classifier 1