VLM-Cholecystectomie
This repository contains models for Surgical Phase, Step, Target and Tools Recognition in Laparoscopic Cholecystectomy videos.
It features two distinct approaches: a lightweight custom ViT (ResNet+Transformer) and a large-scale finetuned Qwen3-VL.
Models
1. ViT-ResNet
A lightweight, specialized architecture designed for efficient video classification.
- Backbone: ResNet50 (Frozen)
- Aggregator: Temporal Transformer Encoder
- MLP Heads: for Phase, Step, Target and Tool prediction.
2. Qwen3-VL
A Vision-Language Model finetuned for surgical understanding.
- Base Model:
unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit - Method: LoRA Finetuning (Vision & Language layers)
Tasks
The models are trained to predict three levels of surgical granularity simultaneously:
- Phase: High-level surgical stages (e.g.,
PREPARATION,CALOT_TRIANGLE_DISSECTION). - Step: Fine-grained surgical actions (e.g.,
CYSTIC_DUCT_DISSECTION,CLIPPING). - Target: The anatomical structure or object being operated on (e.g.,
CYSTIC_ARTERY,GALLBLADDER). - Tool(s): The list of tool(s) being actively used in the surgery (e.g.,
GRASPER HOOK,GRASPER).
Usage
Inference with ViT-ResNet
The ViT model requires the specific architecture definition (available in the src folder of the associated code repository or the Space).
import torch
from model_utils import SurgicalTransformer # Custom class
# Load Model
model = SurgicalTransformer(vocab_size_dict={"phase": 7, "step": 30, "target": 29})
checkpoint = torch.load("models/vit_v1/vit_resampling_v1.pt")
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
# Inference
# input_tensor: [Batch, Time, Channels, Height, Width]
output = model(input_tensor)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for Bopa-Boptech/VLM-Cholecystectomie
Unable to build the model tree, the base model loops to the model itself. Learn more.