Modality Router LoRA - Smart Output Modality Selection
Part of the MoM (Mixture of Models) family for vLLM Semantic Router.
A LoRA adapter fine-tuned on mmbert-32k-yarn (307M parameter ModernBERT with 32K context and 1800+ language support) that classifies user prompt intent into the appropriate response modality:
| Label | Description | Routed To | Example |
|---|---|---|---|
| AR | Text-only response | Autoregressive LLM (e.g., Llama, Qwen) | "What is the capital of France?" |
| DIFFUSION | Image generation | Diffusion model (e.g., Flux, SDXL) | "A cyberpunk city at night, neon lights" |
| BOTH | Text + image response | Both AR + Diffusion pipeline | "Explain photosynthesis and show a diagram" |
Usage
With PEFT (LoRA adapter)
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load base model + LoRA adapter
base_model = AutoModelForSequenceClassification.from_pretrained(
"llm-semantic-router/mmbert-32k-yarn", num_labels=3
)
model = PeftModel.from_pretrained(base_model, "llm-semantic-router/mmbert32k-modality-router-lora")
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert32k-modality-router-lora")
# Label mapping
labels = {0: "AR", 1: "DIFFUSION", 2: "BOTH"}
# Classify prompts
prompts = [
"What are the benefits of exercise?",
"A serene Japanese garden with cherry blossoms, watercolor style",
"Explain how neural networks work and generate a diagram showing the architecture",
]
model.eval()
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=-1).item()
probs = torch.softmax(outputs.logits, dim=-1)[0]
print(f"Prompt: {prompt[:60]}...")
print(f" -> {labels[pred]} (confidence: {probs[pred]:.3f})")
print()
Use the merged model (recommended for production)
For easier deployment without PEFT dependency, use the merged version: llm-semantic-router/mmbert32k-modality-router-merged
from transformers import pipeline
pipe = pipeline(
"text-classification",
model="llm-semantic-router/mmbert32k-modality-router-merged",
)
result = pipe("Draw a picture of a sunset over mountains")
print(result) # [{'label': 'DIFFUSION', 'score': 0.97}]
Model Details
Architecture
- Base model:
llm-semantic-router/mmbert-32k-yarn(307M parameters, ModernBERT + YaRN RoPE) - Context length: 32,768 tokens
- Languages: 1800+ (via Gemma 2 tokenizer with 256K vocab)
- Adaptation: LoRA (Low-Rank Adaptation) via PEFT
- LoRA rank: 16
- LoRA alpha: 32
- LoRA dropout: 0.1
- Target modules:
attn.Wqkv,attn.Wo,mlp.Wi,mlp.Wo - Modules saved:
classifier,score - Task type: Sequence Classification (3 classes)
- Trainable parameters: ~3.4M (1.09% of total 310M)
Training
- Epochs: 10
- Batch size: 32
- Learning rate: 2e-5
- Weight decay: 0.15 (adaptive)
- Loss function: Focal Loss (gamma=2.0) with inverse-frequency class weights
- Class imbalance handling: Focal Loss + sqrt-dampened class weights + minority oversampling
- Hardware: AMD Instinct MI300X GPU (192GB VRAM)
- Training time: ~2 minutes
Training Data
The model is trained on a curated combination of 10 public datasets plus seed examples:
DIFFUSION class (image generation intent)
| Dataset | Size | Description |
|---|---|---|
| Gustavosta/Stable-Diffusion-Prompts | 80K | Curated Stable Diffusion prompts |
| FredZhang7/stable-diffusion-prompts-2.47M | 2.47M | Large-scale SD prompt collection |
| nateraw/parti-prompts | 1.6K | Google Parti benchmark prompts |
| fal/image-generation-prompts | 1K+ | Diverse image generation prompts |
| allenai/WildChat (mined) | - | Real user prompts with image-generation intent |
AR class (text-only intent)
| Dataset | Size | Description |
|---|---|---|
| OpenAssistant/oasst2 | 135K | Multilingual instruction conversations |
| tatsu-lab/alpaca | 52K | Stanford instruction-following |
| databricks/databricks-dolly-15k | 15K | Categorized instructions |
| stingning/ultrachat | 1.5M | Multi-turn conversations |
| allenai/WildChat (mined) | - | Real user text-only prompts |
BOTH class (mixed modality intent)
| Dataset | Size | Description |
|---|---|---|
| mqliu/InterleavedBench | 7K+ | Gold-standard interleaved text+image prompts (EMNLP 2024) |
| allenai/WildChat (mined) | - | Real user multimodal prompts |
| Seed examples | 40+ | Curated diverse domain examples |
Evaluation Results
| Metric | Value |
|---|---|
| Accuracy | 0.9686 |
| F1 (weighted) | 0.9686 |
| Eval Loss | 0.0435 |
Per-class Performance
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| AR | 0.956 | 0.967 | 0.962 |
| DIFFUSION | 0.974 | 0.979 | 0.977 |
| BOTH | 0.983 | 0.951 | 0.967 |
Intended Use
This model is designed for routing LLM requests in multi-model serving systems like vLLM Semantic Router. It enables:
- Smart Output Modality Selection: Automatically determine whether a user query needs text, image, or both
- Automatic Paradigm Routing: Route requests to the right model backend (AR LLM vs Diffusion model)
- Cost Optimization: Avoid sending simple text queries to expensive image generation pipelines
Out-of-Scope Use
- Not suitable for content moderation or safety classification
- Not designed for multi-turn conversation context (single-turn prompt classification only)
- May have reduced accuracy for very short or ambiguous prompts
Citation
@misc{modality-router-2025,
title={Modality Router: Smart Output Modality Selection for Multi-Model Serving},
author={vLLM Semantic Router Team},
year={2025},
url={https://huggingface.co/llm-semantic-router/mmbert32k-modality-router-lora}
}
Framework Versions
- PEFT: 0.18.1
- Transformers: 4.57.6
- PyTorch: 2.9.1
- Python: 3.12
- Downloads last month
- 79
Model tree for llm-semantic-router/mmbert32k-modality-router-lora
Base model
jhu-clsp/mmBERT-base
Quantized
llm-semantic-router/mmbert-32k-yarn
Datasets used to train llm-semantic-router/mmbert32k-modality-router-lora
Evaluation results
- Accuracyself-reported0.969
- F1 (weighted)self-reported0.969
- F1 ARself-reported0.962
- F1 DIFFUSIONself-reported0.977
- F1 BOTHself-reported0.967