TESS-Computer
/

tess-500m

Image-Text-to-Text

vision-language-model

screen-understanding

Model card Files Files and versions

tess-500m / README.md

HusseinLezzaik's picture

Upload README.md with huggingface_hub

c3ad5aa verified 3 months ago

|

history blame contribute delete

2.83 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- computer-use
	- gui-agent
	- vision-language-model
	- screen-understanding
	- vla
	datasets:
	- TESS-Computer/tess-agentnet
	base_model: HuggingFaceTB/SmolVLM2-500M-Instruct
	pipeline_tag: image-text-to-text
	---

	# TESS-500M

	TESS is a Vision-Language-Action (VLA) model for computer use, inspired by robotic VLAs. Given a screenshot and natural language instruction, it predicts either a mouse action (click coordinates) or keyboard action (typing/shortcuts).

	## Model Description

	- Base Model: SmolVLM2-500M-Instruct
	- Architecture: SmolVLM + Router + Mouse/Keyboard heads
	- Parameters: 508M total, 48M trainable
	- Training Data: [tess-agentnet](https://huggingface.co/datasets/TESS-Computer/tess-agentnet) (~312K samples)

	## Usage

	```python
	import torch
	from PIL import Image

	# Clone the TESS repo
	# git clone https://github.com/husseinlezzaik/TESS.git
	# cd TESS/model

	from test_checkpoint import load_model, predict

	# Load model
	model, processor = load_model("path/to/checkpoint.pt", device="cuda")

	# Run inference
	image = Image.open("screenshot.png")
	result = predict(model, processor, image, "Click the search button")

	print(result)
	# Mouse action: {'action_type': 'mouse', 'xy': array([0.45, 0.32]), 'click_type': 'LEFT_CLICK'}
	# Keyboard action: {'action_type': 'keyboard', 'action': 'type', 'value': 'hello world'}
	```

	## Output Format

	Mouse actions:
	```python
	{
	'action_type': 'mouse',
	'xy': [x, y], # Normalized coordinates (0-1)
	'click_type': 'LEFT_CLICK' \| 'RIGHT_CLICK' \| 'DOUBLE_CLICK' \| ...
	}
	```

	Keyboard actions:
	```python
	{
	'action_type': 'keyboard',
	'action': 'type' \| 'press' \| 'hotkey',
	'value': 'text to type' \| '<ENTER>' \| '<SUPER+C>'
	}
	```

	## Architecture

	```
	Screenshot + Instruction → SmolVLM2 → Shared MLP → Router
	↓
	┌───────────────┴───────────────┐
	↓ ↓
	Mouse Branch Keyboard Branch
	(XY + Click heads) (VLM text generation)
	```

	## Training

	- Epochs: 3
	- Batch Size: 48
	- Optimizer: AdamW (LR 2e-4 heads, 5e-4 embeddings)
	- Hardware: NVIDIA H100 80GB
	- Training Time: ~8 hours

	## Limitations

	- Trained primarily on desktop/web screenshots
	- English instructions only
	- May struggle with unusual UI layouts not seen in training

	## License

	Apache 2.0

	## Citation

	```bibtex
	@misc{tess2025,
	title={TESS: A Vision-Language-Action Model for Computer Use},
	author={Hussein Lezzaik},
	year={2025},
	url={https://github.com/husseinlezzaik/TESS}
	}
	```