jianwenzh commited on
Commit
89516d1
·
verified ·
1 Parent(s): 489089e

Update README.md

Browse files

Correction on typo

Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -244,7 +244,7 @@ RFT: ~40K examples
244
  - **Not suitable for high-resolution images**
245
 
246
  There are two factors in the base model [SmolVLM2-500M](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) which limit the capability of handling high resolution images:
247
- 1. Aggressive pixel shuffle (r=4), compressing 64 pixels into one token.
248
  2. Fixed scaling to 2048px on the longest side.
249
 
250
  Together, they severely impact grounding on high-resolution screens. We evaluated Vocaela-500M on ScreenSpotPro. The average score is 15.1. Although still better than several 2B/3B/7B models, the relative superiority is much worse compared to that on ScreenSpot/ScreenSpotV2. The result verifies the limitation.
 
244
  - **Not suitable for high-resolution images**
245
 
246
  There are two factors in the base model [SmolVLM2-500M](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) which limit the capability of handling high resolution images:
247
+ 1. Aggressive pixel shuffle (r=4), compressing 64x64 pixels into one token.
248
  2. Fixed scaling to 2048px on the longest side.
249
 
250
  Together, they severely impact grounding on high-resolution screens. We evaluated Vocaela-500M on ScreenSpotPro. The average score is 15.1. Although still better than several 2B/3B/7B models, the relative superiority is much worse compared to that on ScreenSpot/ScreenSpotV2. The result verifies the limitation.