Configuration Parsing Warning: In UNKNOWN_FILENAME: "diffusers._class_name" must be a string
π‘ DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
This is the diffusers-compatible version of DeepGen-1.0. The model weights are stored in safetensors format with a self-contained pipeline script (deepgen_pipeline.py) β no need to clone the DeepGen repository.
DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβgeneral image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβwithin a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ to 16Γ larger.
π οΈ Quick Start
Installation
pip install torch diffusers transformers safetensors einops accelerate huggingface_hub
# Flash Attention (recommended)
pip install flash-attn --no-build-isolation
Load Pipeline
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"deepgenteam/DeepGen-1.0-diffusers",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
pipe.to("cuda")
# Optional: enable CPU offload for GPUs with limited memory (< 24GB)
# pipe.enable_model_cpu_offload()
Text-to-Image
result = pipe(
prompt="a racoon holding a shiny red apple over its head",
height=512, width=512,
num_inference_steps=50,
guidance_scale=4.0,
seed=42,
)
result.images[0].save("output.png")
Image Editing
from PIL import Image
source_image = Image.open("guitar.png").convert("RGB")
result = pipe(
prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.",
image=source_image,
height=512, width=512,
num_inference_steps=50,
guidance_scale=4.0,
seed=42,
)
result.images[0].save("edited.png")
π Parameters
| Parameter | Default | Description |
|---|---|---|
prompt |
required | Text prompt for generation or editing |
image |
None |
Input image for editing. If None, performs text-to-image generation |
height |
512 | Output image height |
width |
512 | Output image width |
num_inference_steps |
50 | Number of denoising steps |
guidance_scale |
4.0 | Classifier-free guidance scale |
seed |
None |
Random seed for reproducibility |
negative_prompt |
"" |
Negative prompt for CFG |
πΎ Memory Requirements
| Mode | VRAM |
|---|---|
| Full GPU | ~20 GB |
CPU Offload (pipe.enable_model_cpu_offload()) |
~14 GB |
π Directory Structure
DeepGen-1.0-diffusers/
βββ transformer/ # SD3 DiT weights (safetensors)
βββ vae/ # AutoencoderKL weights
βββ connector/ # SCB Connector weights + config
βββ scheduler/ # FlowMatchEulerDiscreteScheduler config
βββ tokenizer/ # Qwen2.5-VL tokenizer
βββ prompt_template.json # Prompt formatting template
βββ model_index.json # Model metadata
βββ deepgen_pipeline.py # Self-contained pipeline script
Note: The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from Qwen/Qwen2.5-VL-3B-Instruct. You can override the VLM path using the
vlm_model_pathparameter infrom_pretrained().
π§ Method
Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance.
| Component | Parameters | Description |
|---|---|---|
| VLM (Qwen2.5-VL-3B) | 3B | Visual Language Model for understanding prompts and reference images |
| Connector (SCB) | ~0.8B | 6-layer Transformer bridging VLM hidden states to DiT conditioning |
| DiT (SD3.5M Kontext) | 2B | Diffusion Transformer for image generation |
| VAE | ~80M | Image encoder/decoder |
π Benchmarks
1. General Image Generation
| Model | Params | Geneval β | DPGBench β | UniGenBench β |
|---|---|---|---|---|
| OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 |
| BAGEL | 14B | 0.82 | 85.10 | 61.53 |
| X-Omni | 7B + 12B | 0.83 | 87.65π₯ | 53.77 |
| Lumina-DiMOO | 8B | 0.88π₯ | 86.04 | 71.12 |
| Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β |
| Qwen-Image | 7B + 20B | 0.87 π₯ | 88.32 π₯ | 78.81 π₯ |
| LongCat-Image | 7B + 6B | 0.87 π₯ | 86.80 | β |
| Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 |
| GLM-Image | 9B + 7B | β | 84.78 | β |
| DeepGen 1.0 (SFT) | 3B + 2B | 0.86 π₯ | 87.05 | 74.18 π₯ |
| DeepGen 1.0 (RL) | 3B + 2B | 0.87 π₯ | 87.90 π₯ | 75.74 π₯ |
2. General Image Editing
| Model | Params | GEdit-EN β | ImgEdit β |
|---|---|---|---|
| BAGEL | 14B | 6.52 | 3.20 |
| Qwen-Image-Edit [2509] | 7B + 20B | 7.54 π₯ | 4.35 π₯ |
| LongCat-Image-Edit | 7B + 6B | 7.60 π₯ | 4.50 π₯ |
| Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 |
| DeepGen 1.0 (SFT) | 3B + 2B | 7.12 | 4.09 |
| DeepGen 1.0 (RL) | 3B + 2B | 7.17 π₯ | 4.14 π₯ |
3. Reasoning Image Generation
| Model | Params | WISE β | T2I-CoREBench β |
|---|---|---|---|
| OmniGen2 | 3B + 4B | 0.47 | 36.1 |
| BAGEL | 14B | 0.70 π₯ | 41.1 |
| Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 |
| Qwen-Image | 7B + 20B | 0.62 | 46.3 π₯ |
| LongCat-Image | 7B + 6B | 0.65 | 52.2 π₯ |
| Z-Image-Turbo | 4B + 6B | - | 43.7 |
| DeepGen 1.0 (SFT) | 3B + 2B | 0.72 π₯ | 45.7 |
| DeepGen 1.0 (RL) | 3B + 2B | 0.73 π₯ | 46.5 π₯ |
4. Reasoning Image Editing
| Model | Params | RISE β | UniREditBench β |
|---|---|---|---|
| OmniGen2 | 3B + 4B | - | 43.4 |
| BAGEL | 14B | 11.9 π₯ | 51.0 |
| Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 π₯ |
| DeepGen 1.0 (SFT) | 3B + 2B | 13.3 π₯ | 77.5 π₯ |
| DeepGen 1.0 (RL) | 3B + 2B | 10.8 π₯ | 75.7 π₯ |
β Citation
@article{wang2026deepgen,
title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
journal={arXiv preprint arXiv:2602.12205},
year={2026}
}
License
Apache 2.0
- Downloads last month
- 342
Model tree for deepgenteam/DeepGen-1.0-diffusers
Base model
Qwen/Qwen2.5-VL-3B-Instruct