| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - computer-use |
| | - gui-agent |
| | - vision-language-model |
| | - screen-understanding |
| | - vla |
| | datasets: |
| | - TESS-Computer/tess-agentnet |
| | base_model: HuggingFaceTB/SmolVLM2-500M-Instruct |
| | pipeline_tag: image-text-to-text |
| | --- |
| | |
| | # TESS-500M |
| |
|
| | **TESS** is a Vision-Language-Action (VLA) model for computer use, inspired by robotic VLAs. Given a screenshot and natural language instruction, it predicts either a mouse action (click coordinates) or keyboard action (typing/shortcuts). |
| |
|
| | ## Model Description |
| |
|
| | - **Base Model**: SmolVLM2-500M-Instruct |
| | - **Architecture**: SmolVLM + Router + Mouse/Keyboard heads |
| | - **Parameters**: 508M total, 48M trainable |
| | - **Training Data**: [tess-agentnet](https://huggingface.co/datasets/TESS-Computer/tess-agentnet) (~312K samples) |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import torch |
| | from PIL import Image |
| | |
| | # Clone the TESS repo |
| | # git clone https://github.com/husseinlezzaik/TESS.git |
| | # cd TESS/model |
| | |
| | from test_checkpoint import load_model, predict |
| | |
| | # Load model |
| | model, processor = load_model("path/to/checkpoint.pt", device="cuda") |
| | |
| | # Run inference |
| | image = Image.open("screenshot.png") |
| | result = predict(model, processor, image, "Click the search button") |
| | |
| | print(result) |
| | # Mouse action: {'action_type': 'mouse', 'xy': array([0.45, 0.32]), 'click_type': 'LEFT_CLICK'} |
| | # Keyboard action: {'action_type': 'keyboard', 'action': 'type', 'value': 'hello world'} |
| | ``` |
| |
|
| | ## Output Format |
| |
|
| | **Mouse actions:** |
| | ```python |
| | { |
| | 'action_type': 'mouse', |
| | 'xy': [x, y], # Normalized coordinates (0-1) |
| | 'click_type': 'LEFT_CLICK' | 'RIGHT_CLICK' | 'DOUBLE_CLICK' | ... |
| | } |
| | ``` |
| |
|
| | **Keyboard actions:** |
| | ```python |
| | { |
| | 'action_type': 'keyboard', |
| | 'action': 'type' | 'press' | 'hotkey', |
| | 'value': 'text to type' | '<ENTER>' | '<SUPER+C>' |
| | } |
| | ``` |
| |
|
| | ## Architecture |
| |
|
| | ``` |
| | Screenshot + Instruction β SmolVLM2 β Shared MLP β Router |
| | β |
| | βββββββββββββββββ΄ββββββββββββββββ |
| | β β |
| | Mouse Branch Keyboard Branch |
| | (XY + Click heads) (VLM text generation) |
| | ``` |
| |
|
| | ## Training |
| |
|
| | - **Epochs**: 3 |
| | - **Batch Size**: 48 |
| | - **Optimizer**: AdamW (LR 2e-4 heads, 5e-4 embeddings) |
| | - **Hardware**: NVIDIA H100 80GB |
| | - **Training Time**: ~8 hours |
| |
|
| | ## Limitations |
| |
|
| | - Trained primarily on desktop/web screenshots |
| | - English instructions only |
| | - May struggle with unusual UI layouts not seen in training |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{tess2025, |
| | title={TESS: A Vision-Language-Action Model for Computer Use}, |
| | author={Hussein Lezzaik}, |
| | year={2025}, |
| | url={https://github.com/husseinlezzaik/TESS} |
| | } |
| | ``` |
| |
|