vocaela
/

Vocaela-500M

Model card Files Files and versions

jianwenzh commited on Oct 19

Commit

89516d1

·

verified ·

1 Parent(s): 489089e

Update README.md

Correction on typo

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -244,7 +244,7 @@ RFT: ~40K examples
 - **Not suitable for high-resolution images**
   There are two factors in the base model [SmolVLM2-500M](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) which limit the capability of handling high resolution images:
-    1. Aggressive pixel shuffle (r=4), compressing 64 pixels into one token.
     2. Fixed scaling to 2048px on the longest side.
     Together, they severely impact grounding on high-resolution screens. We evaluated Vocaela-500M on ScreenSpotPro. The average score is 15.1. Although still better than several 2B/3B/7B models, the relative superiority is much worse compared to that on ScreenSpot/ScreenSpotV2. The result verifies the limitation.

 - **Not suitable for high-resolution images**
   There are two factors in the base model [SmolVLM2-500M](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) which limit the capability of handling high resolution images:
+    1. Aggressive pixel shuffle (r=4), compressing 64x64 pixels into one token.
     2. Fixed scaling to 2048px on the longest side.
     Together, they severely impact grounding on high-resolution screens. We evaluated Vocaela-500M on ScreenSpotPro. The average score is 15.1. Although still better than several 2B/3B/7B models, the relative superiority is much worse compared to that on ScreenSpot/ScreenSpotV2. The result verifies the limitation.