--- license: mit language: - en base_model: - Qwen/Qwen2.5-VL-7B-Instruct tags: - self-reward - unsupervised-learning - code pipeline_tag: image-text-to-text library_name: transformers --- # EvoLMM — LoRA Adapters for Qwen2.5-VL Lightweight **LoRA** adapters for the **EvoLMM** framework built on **Qwen/Qwen2.5-VL-7B-Instruct**. Use these adapters with the base model to run inference or evaluation without full fine-tuning weights.
Project Page GitHub arXiv
## Requirements ```bash pip install "transformers>=4.43" peft "accelerate>=0.25" pillow qwen-vl-utils torch export HF_TOKEN=hf_******************************** ``` --- ## Quick Start (Transformers + PEFT) ```python import torch from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration from peft import PeftModel BASE = "Qwen/Qwen2.5-VL-7B-Instruct" LORA_REPO = "omkarthawakar/EvoLMM" SUBFOLDER = "solver" DTYPE = torch.bfloat16 # Loading base model model = Qwen2_5_VLForConditionalGeneration.from_pretrained( BASE, device_map="auto", torch_dtype=DTYPE ) # Attachng LoRA model = PeftModel.from_pretrained( model, LORA_REPO, subfolder=SUBFOLDER, token=None, use_safetensors=True, ) processor = AutoProcessor.from_pretrained(BASE) model.eval() ``` ### Minimal single-image inference ```python from qwen_vl_utils import process_vision_info from PIL import Image msg = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [ {"type": "image", "image": Image.open("./assets/demo.png").convert("RGB")}, {"type": "text", "text": "What is the main object in this image?"} ]}, ] text = processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info([msg]) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt" ).to(model.device) out = model.generate(**inputs, max_new_tokens=512, do_sample=False) gen_only = out[0, inputs.input_ids.shape[1]:] print(processor.tokenizer.decode(gen_only, skip_special_tokens=True).strip()) ``` --- ## License Weights and code follow the licenses of the base model and this repository. Check the base model’s license at `Qwen/Qwen2.5-VL-7B-Instruct`. Ensure your usage complies with third-party terms. --- ## Citation If you use these adapters, please cite EvoLMM: ```bibtex @misc{thawakar2025evolmmselfevolvinglargemultimodal, title={EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards}, author={Omkar Thawakar and Shravan Venkatraman and Ritesh Thawkar and Abdelrahman Shaker and Hisham Cholakkal and Rao Muhammad Anwer and Salman Khan and Fahad Khan}, year={2025}, eprint={2511.16672}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.16672}, } ``` --- ## Acknowledgements Built on top of the **Qwen2.5-VL** family, **Transformers**, **PEFT**, and **Accelerate**. Thanks to the open-source community for tools that make adapter training and sharing straightforward.