YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
COMP550 Toxicity Project
Clean Python pipeline for:
- Data prep and reproducible splits
- Classical baselines
- BiLSTM training
- Transformer fine-tuning
- Robustness testing (noise)
- Domain-shift evaluation
- Optional prompted LLM evaluation
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
All commands below assume:
source .venv/bin/activate
export PYTHONPATH=src
1) Prepare Data Splits
python3 -m toxicity.data.prepare_data \
--raw-dir dataset \
--output-dir data/processed \
--seed 42
Outputs:
data/processed/train.csvdata/processed/val.csvdata/processed/test.csvdata/processed/metadata.json
2) Train Classical Baselines
TF-IDF + Logistic Regression
python3 -m toxicity.training.train_baseline \
--model-type logreg \
--train-csv data/processed/train.csv \
--val-csv data/processed/val.csv \
--test-csv data/processed/test.csv \
--output-dir artifacts/baseline_tfidf_logreg
TF-IDF + Linear SVM
python3 -m toxicity.training.train_baseline \
--model-type linear_svm \
--train-csv data/processed/train.csv \
--val-csv data/processed/val.csv \
--test-csv data/processed/test.csv \
--output-dir artifacts/baseline_tfidf_linearsvm
3) Train BiLSTM (GPU)
python3 -m toxicity.training.train_bilstm \
--train-csv data/processed/train.csv \
--val-csv data/processed/val.csv \
--test-csv data/processed/test.csv \
--epochs 5 \
--batch-size 128 \
--num-workers 4 \
--output-dir artifacts/bilstm
Useful knobs:
--max-train-samples,--max-val-samples,--max-test-samplesfor quick debug runs--embedding-dim,--hidden-dim,--lr,--dropout
4) Fine-tune DistilBERT (GPU)
Prefer GPU explicitly and use bfloat16 on A100/Ampere (often faster and more stable than --fp16):
python3 -m toxicity.training.train_transformer \
--device cuda \
--model-name distilbert-base-uncased \
--train-csv data/processed/train.csv \
--val-csv data/processed/val.csv \
--test-csv data/processed/test.csv \
--epochs 2 \
--train-batch-size 16 \
--eval-batch-size 32 \
--gradient-accumulation-steps 1 \
--bf16 \
--output-dir artifacts/distilbert
If --device cuda exits with “PyTorch does not see a usable GPU”, your installed PyTorch build does not match the NVIDIA driver — pick a wheel from pytorch.org that fits nvidia-smi, or upgrade the driver.
Common trap: plain pip install torch (even with --index-url …/cu124) can still resolve to 2.11.0+cu130, which targets CUDA 13 and fails on drivers that only support 12.x. Fix by pinning a +cu124 build, e.g. install from this repo’s requirements-gpu.txt:
pip install -r requirements-gpu.txt --index-url https://download.pytorch.org/whl/cu124
Then python3 -c "import torch; print(torch.__version__, torch.cuda.is_available())" should show +cu124 and True.
--device auto (default) uses the GPU when torch.cuda.is_available() is true; --fp16 and --bf16 are mutually exclusive.
Model swap examples:
--model-name bert-base-uncased--model-name roberta-base
5) Build Noisy Test Sets
python3 -m toxicity.eval.build_noisy_splits \
--input-csv data/processed/test.csv \
--output-dir data/processed/noisy \
--noise-types misspelling leetspeak casing punctuation \
--severity 0.20
6) Evaluate Saved Models (clean or noisy sets)
Baseline checkpoint
python3 -m toxicity.eval.eval_saved_model \
--model-type baseline \
--checkpoint artifacts/baseline_tfidf_logreg/model.joblib \
--input-csv data/processed/test.csv \
--output-dir artifacts/eval/baseline_clean
BiLSTM checkpoint
python3 -m toxicity.eval.eval_saved_model \
--model-type bilstm \
--checkpoint artifacts/bilstm/best_model.pt \
--input-csv data/processed/noisy/test_casing_sev0.20.csv \
--batch-size 128 \
--output-dir artifacts/eval/bilstm_noisy_casing
Transformer checkpoint
python3 -m toxicity.eval.eval_saved_model \
--model-type transformer \
--checkpoint artifacts/distilbert/best_model \
--input-csv data/processed/noisy/test_leetspeak_sev0.20.csv \
--batch-size 32 \
--max-length 256 \
--output-dir artifacts/eval/distilbert_noisy_leetspeak
7) Domain-Shift Evaluation
Supported external datasets:
civil_commentstweet_eval_offensive
python3 -m toxicity.eval.domain_shift \
--model-type transformer \
--checkpoint artifacts/distilbert/best_model \
--dataset-name civil_comments \
--split test \
--sample-size 20000 \
--output-dir artifacts/domain_shift/distilbert_civil_comments
Notes:
- Domain-shift scoring is computed on mapped
toxiconly. - Output includes
domain_shift_metrics.jsonanddomain_shift_predictions.csv.
8) Optional Prompted LLM Evaluation
llm_prompt_eval loads any causal LM from Hugging Face (--model-name). It uses the tokenizer chat template when the model defines one, which matches Qwen 3.5 and Gemma 4 IT style checkpoints.
Auth: Gemma weights are gated — run huggingface-cli login and accept the license on the model page first.
VRAM: Large models need a big GPU; use --torch-dtype bfloat16 (or float16) on CUDA to save memory.
Qwen 2.5 (small baseline):
python3 -m toxicity.eval.llm_prompt_eval \
--input-csv data/processed/test.csv \
--model-name Qwen/Qwen2.5-3B-Instruct \
--max-samples 2000 \
--torch-dtype bfloat16 \
--output-dir artifacts/llm_prompt_eval/qwen2.5-3b
Qwen 3.5 (official Qwen/* chat models — pick a size):
python3 -m toxicity.eval.llm_prompt_eval \
--input-csv data/processed/test.csv \
--model-name Qwen/Qwen3.5-4B \
--max-samples 2000 \
--torch-dtype bfloat16 \
--output-dir artifacts/llm_prompt_eval/qwen3.5-4b
Other common IDs: Qwen/Qwen3.5-9B, Qwen/Qwen3.5-27B, Qwen/Qwen3.5-35B-A3B (MoE; heavier).
Gemma 4 (instruct-tuned -it checkpoints):
python3 -m toxicity.eval.llm_prompt_eval \
--input-csv data/processed/test.csv \
--model-name google/gemma-4-26B-A4B-it \
--max-samples 2000 \
--torch-dtype bfloat16 \
--output-dir artifacts/llm_prompt_eval/gemma4-26b-a4b-it
Other examples: google/gemma-4-31B-it (dense, needs a large GPU), google/gemma-4-E4B-it / google/gemma-4-E2B-it (smaller MoE variants — search google/gemma-4 on Hugging Face for IDs and license terms).
9) Run Main Pipeline in One Command
bash scripts/run_all_trainings.sh
Output Structure
Each training/eval folder writes:
- Metrics JSON (
metrics_val.json,metrics_test.json, ormetrics.json) - Prediction CSV with per-label probabilities and binary predictions
- Config/args JSON used for that run
Compare artifact metrics (plots)
Aggregate every metrics_test.json, metrics_val.json, and metrics.json under artifacts/ and write comparison charts plus a CSV:
python3 -m toxicity.eval.plot_artifact_comparison \
--artifacts-dir artifacts \
--output-dir artifacts/plots
Outputs in --output-dir:
metrics_summary.csv— one row per run (micro/macro F1, ROC-AUC, paths)comparison_f1_bars.png— horizontal bars for micro and macro F1comparison_roc_auc_bars.png— micro vs macro ROC-AUC side by sidecomparison_per_label_f1_heatmap.png— labels × runs F1 heatmap