Update README.md
Browse filesCorrection on typo
README.md
CHANGED
|
@@ -244,7 +244,7 @@ RFT: ~40K examples
|
|
| 244 |
- **Not suitable for high-resolution images**
|
| 245 |
|
| 246 |
There are two factors in the base model [SmolVLM2-500M](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) which limit the capability of handling high resolution images:
|
| 247 |
-
1. Aggressive pixel shuffle (r=4), compressing
|
| 248 |
2. Fixed scaling to 2048px on the longest side.
|
| 249 |
|
| 250 |
Together, they severely impact grounding on high-resolution screens. We evaluated Vocaela-500M on ScreenSpotPro. The average score is 15.1. Although still better than several 2B/3B/7B models, the relative superiority is much worse compared to that on ScreenSpot/ScreenSpotV2. The result verifies the limitation.
|
|
|
|
| 244 |
- **Not suitable for high-resolution images**
|
| 245 |
|
| 246 |
There are two factors in the base model [SmolVLM2-500M](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) which limit the capability of handling high resolution images:
|
| 247 |
+
1. Aggressive pixel shuffle (r=4), compressing 64x64 pixels into one token.
|
| 248 |
2. Fixed scaling to 2048px on the longest side.
|
| 249 |
|
| 250 |
Together, they severely impact grounding on high-resolution screens. We evaluated Vocaela-500M on ScreenSpotPro. The average score is 15.1. Although still better than several 2B/3B/7B models, the relative superiority is much worse compared to that on ScreenSpot/ScreenSpotV2. The result verifies the limitation.
|