Modality Router LoRA - Smart Output Modality Selection

Part of the MoM (Mixture of Models) family for vLLM Semantic Router.

A LoRA adapter fine-tuned on mmbert-32k-yarn (307M parameter ModernBERT with 32K context and 1800+ language support) that classifies user prompt intent into the appropriate response modality:

Label	Description	Routed To	Example
AR	Text-only response	Autoregressive LLM (e.g., Llama, Qwen)	"What is the capital of France?"
DIFFUSION	Image generation	Diffusion model (e.g., Flux, SDXL)	"A cyberpunk city at night, neon lights"
BOTH	Text + image response	Both AR + Diffusion pipeline	"Explain photosynthesis and show a diagram"

Usage

With PEFT (LoRA adapter)

from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load base model + LoRA adapter
base_model = AutoModelForSequenceClassification.from_pretrained(
    "llm-semantic-router/mmbert-32k-yarn", num_labels=3
)
model = PeftModel.from_pretrained(base_model, "llm-semantic-router/mmbert32k-modality-router-lora")
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert32k-modality-router-lora")

# Label mapping
labels = {0: "AR", 1: "DIFFUSION", 2: "BOTH"}

# Classify prompts
prompts = [
    "What are the benefits of exercise?",
    "A serene Japanese garden with cherry blossoms, watercolor style",
    "Explain how neural networks work and generate a diagram showing the architecture",
]

model.eval()
for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    pred = torch.argmax(outputs.logits, dim=-1).item()
    probs = torch.softmax(outputs.logits, dim=-1)[0]
    print(f"Prompt: {prompt[:60]}...")
    print(f"  -> {labels[pred]} (confidence: {probs[pred]:.3f})")
    print()

Use the merged model (recommended for production)

For easier deployment without PEFT dependency, use the merged version: llm-semantic-router/mmbert32k-modality-router-merged

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="llm-semantic-router/mmbert32k-modality-router-merged",
)
result = pipe("Draw a picture of a sunset over mountains")
print(result)  # [{'label': 'DIFFUSION', 'score': 0.97}]

Model Details

Architecture

Base model: llm-semantic-router/mmbert-32k-yarn (307M parameters, ModernBERT + YaRN RoPE)
Context length: 32,768 tokens
Languages: 1800+ (via Gemma 2 tokenizer with 256K vocab)
Adaptation: LoRA (Low-Rank Adaptation) via PEFT
LoRA rank: 16
LoRA alpha: 32
LoRA dropout: 0.1
Target modules: attn.Wqkv, attn.Wo, mlp.Wi, mlp.Wo
Modules saved: classifier, score
Task type: Sequence Classification (3 classes)
Trainable parameters: ~3.4M (1.09% of total 310M)

Training

Epochs: 10
Batch size: 32
Learning rate: 2e-5
Weight decay: 0.15 (adaptive)
Loss function: Focal Loss (gamma=2.0) with inverse-frequency class weights
Class imbalance handling: Focal Loss + sqrt-dampened class weights + minority oversampling
Hardware: AMD Instinct MI300X GPU (192GB VRAM)
Training time: ~2 minutes

Training Data

The model is trained on a curated combination of 10 public datasets plus seed examples:

DIFFUSION class (image generation intent)

Dataset	Size	Description
Gustavosta/Stable-Diffusion-Prompts	80K	Curated Stable Diffusion prompts
FredZhang7/stable-diffusion-prompts-2.47M	2.47M	Large-scale SD prompt collection
nateraw/parti-prompts	1.6K	Google Parti benchmark prompts
fal/image-generation-prompts	1K+	Diverse image generation prompts
allenai/WildChat (mined)	-	Real user prompts with image-generation intent

AR class (text-only intent)

Dataset	Size	Description
OpenAssistant/oasst2	135K	Multilingual instruction conversations
tatsu-lab/alpaca	52K	Stanford instruction-following
databricks/databricks-dolly-15k	15K	Categorized instructions
stingning/ultrachat	1.5M	Multi-turn conversations
allenai/WildChat (mined)	-	Real user text-only prompts

BOTH class (mixed modality intent)

Dataset	Size	Description
mqliu/InterleavedBench	7K+	Gold-standard interleaved text+image prompts (EMNLP 2024)
allenai/WildChat (mined)	-	Real user multimodal prompts
Seed examples	40+	Curated diverse domain examples

Evaluation Results

Metric	Value
Accuracy	0.9686
F1 (weighted)	0.9686
Eval Loss	0.0435

Per-class Performance

Class	Precision	Recall	F1-Score
AR	0.956	0.967	0.962
DIFFUSION	0.974	0.979	0.977
BOTH	0.983	0.951	0.967

Intended Use

This model is designed for routing LLM requests in multi-model serving systems like vLLM Semantic Router. It enables:

Smart Output Modality Selection: Automatically determine whether a user query needs text, image, or both
Automatic Paradigm Routing: Route requests to the right model backend (AR LLM vs Diffusion model)
Cost Optimization: Avoid sending simple text queries to expensive image generation pipelines

Out-of-Scope Use

Not suitable for content moderation or safety classification
Not designed for multi-turn conversation context (single-turn prompt classification only)
May have reduced accuracy for very short or ambiguous prompts

Citation

@misc{modality-router-2025,
  title={Modality Router: Smart Output Modality Selection for Multi-Model Serving},
  author={vLLM Semantic Router Team},
  year={2025},
  url={https://huggingface.co/llm-semantic-router/mmbert32k-modality-router-lora}
}

Framework Versions

PEFT: 0.18.1
Transformers: 4.57.6
PyTorch: 2.9.1
Python: 3.12

Downloads last month: 79

Model tree for llm-semantic-router/mmbert32k-modality-router-lora

Base model

jhu-clsp/mmBERT-base

Quantized

llm-semantic-router/mmbert-32k-yarn

Adapter

(5)

this model

Datasets used to train llm-semantic-router/mmbert32k-modality-router-lora

Evaluation results

Accuracy
self-reported

0.969
F1 (weighted)
self-reported

0.969
F1 AR
self-reported

0.962
F1 DIFFUSION
self-reported

0.977
F1 BOTH
self-reported

0.967