VLM-Cholecystectomie

This repository contains models for Surgical Phase, Step, Target and Tools Recognition in Laparoscopic Cholecystectomy videos.
It features two distinct approaches: a lightweight custom ViT (ResNet+Transformer) and a large-scale finetuned Qwen3-VL.

Models

1. ViT-ResNet

A lightweight, specialized architecture designed for efficient video classification.

  • Backbone: ResNet50 (Frozen)
  • Aggregator: Temporal Transformer Encoder
  • MLP Heads: for Phase, Step, Target and Tool prediction.

2. Qwen3-VL

A Vision-Language Model finetuned for surgical understanding.

  • Base Model: unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
  • Method: LoRA Finetuning (Vision & Language layers)

Tasks

The models are trained to predict three levels of surgical granularity simultaneously:

  1. Phase: High-level surgical stages (e.g., PREPARATION, CALOT_TRIANGLE_DISSECTION).
  2. Step: Fine-grained surgical actions (e.g., CYSTIC_DUCT_DISSECTION, CLIPPING).
  3. Target: The anatomical structure or object being operated on (e.g., CYSTIC_ARTERY, GALLBLADDER).
  4. Tool(s): The list of tool(s) being actively used in the surgery (e.g., GRASPER HOOK, GRASPER).

Usage

Inference with ViT-ResNet

The ViT model requires the specific architecture definition (available in the src folder of the associated code repository or the Space).

import torch
from model_utils import SurgicalTransformer # Custom class

# Load Model
model = SurgicalTransformer(vocab_size_dict={"phase": 7, "step": 30, "target": 29})
checkpoint = torch.load("models/vit_v1/vit_resampling_v1.pt")
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

# Inference
# input_tensor: [Batch, Time, Channels, Height, Width]
output = model(input_tensor)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bopa-Boptech/VLM-Cholecystectomie

Unable to build the model tree, the base model loops to the model itself. Learn more.