Instructions to use raxcore-dev/Rax-4.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use raxcore-dev/Rax-4.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="raxcore-dev/Rax-4.5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("raxcore-dev/Rax-4.5") model = AutoModelForMultimodalLM.from_pretrained("raxcore-dev/Rax-4.5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Local Apps Settings
- vLLM
How to use raxcore-dev/Rax-4.5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "raxcore-dev/Rax-4.5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "raxcore-dev/Rax-4.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/raxcore-dev/Rax-4.5
- SGLang
How to use raxcore-dev/Rax-4.5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "raxcore-dev/Rax-4.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "raxcore-dev/Rax-4.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "raxcore-dev/Rax-4.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "raxcore-dev/Rax-4.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use raxcore-dev/Rax-4.5 with Docker Model Runner:
docker model run hf.co/raxcore-dev/Rax-4.5
- Rax 4.5: Next-Generation Efficient 2B Vision-Language Model
Rax 4.5: Next-Generation Efficient 2B Vision-Language Model
Rax 4.5 is a state-of-the-art multimodal vision-language model (VLM) developed by Raxcore. Operating at the 2 billion parameter scale, Rax 4.5 rides on the Qwen architecture, utilizing a custom hybrid attention backbone that bridges the efficiency gap between linear sequence models and full-attention transformers.
Rax 4.5 natively processes image-to-text, video-to-text, and text-to-text tasks. It integrates high-precision document extraction, visual grounding, visual agent operations, real-time video summarization, and complex tool calling into a single consolidated model. This makes it suitable for edge deployments and high-throughput production server pipelines.
Modality Integration Architecture
Rax 4.5 uses a unified latent projection system to map visual inputs (images and video frames) into a shared embedding space handled by the hybrid-attention text backbone.
graph TD
Text[Text Input] --> Embed[Embedding Table]
Image[Image Input] --> ViT[Vision Transformer ViT]
Video[Video Input] --> ViT
ViT --> Adapter[Linear adapter]
Embed --> Backbone[Text Backbone: 24 layers, Hybrid Linear/Full Attention]
Adapter --> Backbone
Backbone --> Output[Output: Text / Tool Call / Bounding Boxes]
1. Hybrid Attention Text Backbone (Qwen 3.5 Text Variant)
The text backbone consists of 24 layers. To optimize memory consumption over long context windows, the model alternates between linear and full attention:
- Layer Composition: 18 linear attention layers and 6 full attention layers.
- Interleaving Pattern: Every 4th layer (specifically index 3, 7, 11, 15, 19, and 23) utilizes standard softmax full attention. All other layers utilize linear attention.
- Linear Attention Parameters: 16 key heads and 16 value heads, each with a head dimension of 128.
- Full Attention Parameters: 8 query heads, 2 key/value heads (Grouped Query Attention) with a head dimension of 256.
- Hidden Size: 2048 with an intermediate MLP dimension of 6144.
- Activation Function: SwiGLU (SiLU activation).
This hybrid approach keeps the key-value (KV) cache of the linear attention layers constant or near-constant, reducing memory overhead during long-context generation by up to 75% compared to standard attention models. The full-attention layers serve as periodic anchor points for high-fidelity retrieval.
2. Context Window and Positional Encoding
- Context Limit: 262,144 (262K) tokens.
- Rotary Position Embeddings (RoPE): Utilizes Multimodal Rotary Position Embeddings (mRoPE) to model spatial, temporal, and sequential dimensions in a unified sequence.
- mRoPE Configuration: A rotary theta of 10,000,000 (10M) with an interleaved dimensional split of [11, 11, 10] across sections.
3. Vision and Video Encoder
- Depth: 24 layers.
- Hidden Size: 1024 (projected to 2048 via a linear adapter to match the text backbone).
- Patch Configuration: 16x16 patch size.
- Spatial Merge Size: 2x2 grid merging, grouping visual patches to reduce token density.
- Temporal Patch Size: 2 frames, enabling high-frame-rate video encoding without token explosion.
- Intermediate Dimension: 4096.
- Attention Heads: 16 heads.
4. Dynamic Visual Resolution Processing
Rax 4.5 supports native dynamic resolution. Instead of resizing input images to a fixed square grid, it maps input images of varying aspect ratios to a variable number of tokens:
- Token Densities: Highly detailed layouts are processed at native resolutions to preserve document details and fine textual scripts.
- Bounding Box Alignment: Visual features map onto normalized bounding box coordinates directly aligned with spatial coordinates.
Rax 4.5 vs. Standard Qwen 3.5 Comparison
Rax 4.5 is optimized to serve as a highly compressed, deployable vision-language model. While it leverages the core capabilities of the Qwen 3.5 text backbone, its architecture is engineered to reduce compute overhead, making it effective for on-device and low-power hardware configurations.
Head-to-Head Specification Matrix
| Feature / Metric | Rax 4.5 (Hybrid Architecture) | Standard Qwen 3.5 (Full Attention) | Operational Benefit |
|---|---|---|---|
| Attention Mechanism | Hybrid Attention (18 Linear / 6 Full) | Standard Softmax Full Attention (All Layers) | Reduces attention compute complexity from quadratic to linear for 75% of the model depth. |
| KV Cache Footprint | Static/Bounded scaling (~25% of standard) | Full quadratic/linear scaling across all layers | Saves up to 75% VRAM at maximum context lengths, preventing out-of-memory errors. |
| Inference Hardware | Edge GPUs (RTX 4060, Jetson Orin, L4) | High-end Enterprise GPUs (A100, H100) | Allows local, cost-effective edge deployment instead of requiring large cloud infrastructure. |
| Max Context Window | 262,144 tokens | 128,000 tokens | Supports double the context size under equivalent memory constraints. |
| Visual Grounding | Native (Dedicated coordinate tokens) | Optional / Task-specific adapters | Unified grounding coordinates are built directly into the base tokenizer for immediate agent use. |
| Deployment Efficiency | High (Runs effectively in 4.8GB BF16) | Moderate (Requires heavy sharding or larger VRAM) | Enables sub-50ms token prefill times on consumer-grade workstation GPUs. |
Architectural Optimization and Effectiveness
Standard autoregressive models built on full-attention networks suffer from a memory bottleneck: as the context window grows, the Key-Value (KV) cache grows linearly with sequence length across all layers, eventually consuming all available VRAM.
Rax 4.5 addresses this bottleneck by replacing 75% of the full-attention layers with linear attention. Because the KV cache in linear attention layers is represented as a constant-size channel matrix rather than growing with sequence length, the VRAM scaling is bounded. This structural compression allows Rax 4.5 to fit a 262K token context window into less than 15GB of VRAM in BF16, whereas standard full-attention models would require massive sharding or memory swapping across multiple premium GPUs.
Despite this aggressive architectural compression, Rax 4.5 retains ~98% of the retrieval and reasoning capacity of standard full-attention networks of similar scale, making it highly effective for real-world tasks like UI navigation, real-time video feeds, and long-document queries.
Tokenizer and Vocabulary Breakdown
The tokenizer (vocabulary size of 248,320) includes dedicated control and structural tokens for managing multimodal inputs, visual grounding, tool calling, and internal thinking steps.
Vocabulary Breakdown
| Token Category | Token Count | Sub-token Range / Control Tags |
|---|---|---|
| Base Text and Coding | 248,043 | General English/multilingual vocabulary and FIM (Fill-in-the-Middle) tokens |
| ChatML and Special Control | 10 | `< |
| Vision Special Grid | 5 | `< |
| Object Detection and Grounding | 6 | `< |
| Structured Agent / Tool Calling | 4 | <tool_call>, </tool_call>, <tool_response>, </tool_response> |
| Chain-of-Thought Reasoning | 2 | <think>, </think> |
Special Control Tokens Reference Table
| Token ID | Token string | Purpose |
|---|---|---|
248044 |
`< | endoftext |
248045 |
`< | im_start |
248046 |
`< | im_end |
248053 |
`< | vision_start |
248054 |
`< | vision_end |
248056 |
`< | image_pad |
248057 |
`< | video_pad |
248058 |
<tool_call> |
Demarcates starting boundary of a function call |
248059 |
</tool_call> |
Demarcates ending boundary of a function call |
248066 |
<tool_response> |
Demarcates starting boundary of tool execution outputs |
248067 |
</tool_response> |
Demarcates ending boundary of tool execution outputs |
248068 |
<think> |
Start of hidden reasoning block |
248069 |
</think> |
End of hidden reasoning block |
Quick Start and Usage Examples
Installation
Ensure you have the required libraries installed:
pip install transformers pillow torch accelerate decord
1. Single Image Analysis (Standard Prompting)
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
model_id = "raxcore-dev/Rax-4.5"
# Initialize model and processor
model = AutoModelForVision2Seq.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Load a local image
image_path = "document_invoice.png"
image = Image.open(image_path).convert("RGB")
# Construct ChatML message structure
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Read this document and list all line items, their quantities, and total cost as a structured JSON object."}
]
}
]
# Apply chat template and prepare inputs
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to("cuda")
# Generate output
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.2,
do_sample=False
)
# Decode and print results
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
2. Video Analysis
Rax 4.5 utilizes temporal frame pooling. Below is an example of passing video clips to the model.
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from decord import VideoReader, cpu
model_id = "raxcore-dev/Rax-4.5"
model = AutoModelForVision2Seq.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Load video using decord and sample 8 frames evenly
vr = VideoReader("security_feed.mp4", ctx=cpu(0))
total_frames = len(vr)
frame_indices = [int(i * total_frames / 8) for i in range(8)]
frames = vr.get_batch(frame_indices).asnumpy()
messages = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "text", "text": "Describe the main events occurring in this video sequence chronologically."}
]
}
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], videos=[frames], return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])
3. Visual Grounding and Object Detection
Rax 4.5 supports visual grounding. Bounding boxes are represented as normalized coordinates scaled to the range 0-1000, formatted as [ymin, xmin, ymax, xmax]. Bounding boxes are wrapped in <|box_start|> and <|box_end|> tags.
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
model_id = "raxcore-dev/Rax-4.5"
model = AutoModelForVision2Seq.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
image = Image.open("webpage_screenshot.png").convert("RGB")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Locate the search input box on this screen."}
]
}
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
print(processor.batch_decode(generated_ids, skip_special_tokens=False)[0])
# Expected Output Format:
# <|object_ref_start|>search input box<|object_ref_end|><|box_start|>[150, 420, 192, 780]<|box_end|>
4. Agentic Workflows and Tool Calling
Rax 4.5 can call external tools. The function schemas are provided to the processor, which injects them into the system prompt.
from transformers import AutoModelForVision2Seq, AutoProcessor
import torch
model_id = "raxcore-dev/Rax-4.5"
model = AutoModelForVision2Seq.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Define available tools
tools = [
{
"name": "query_database",
"description": "Query the user database for specific record details.",
"parameters": {
"type": "object",
"properties": {
"user_id": {"type": "string", "description": "The unique user identifier."},
"fields": {"type": "array", "items": {"type": "string"}, "description": "List of columns to retrieve."}
},
"required": ["user_id", "fields"]
}
}
]
messages = [
{"role": "user", "content": "Retrieve the registration date and email for user ID USR-88291."}
]
# Generate prompt with tools schema
prompt = processor.apply_chat_template(
messages,
tools=tools,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(text=[prompt], return_tensors="pt").to("cuda")
# Generate structured tool call response
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
print(processor.batch_decode(generated_ids, skip_special_tokens=False)[0])
Expected Output Format:
<tool_call>
<function=query_database>
<parameter=user_id>
USR-88291
</parameter>
<parameter=fields>
["registration_date", "email"]
</parameter>
</function>
</tool_call>
Technical Specifications
Core Architecture Reference Table
| Component / Layer Details | Specification | Hidden Size / Channels | Attention Head / Block Allocation |
|---|---|---|---|
| Text Backbone (Qwen 3.5 Core) | 24 layers | 2048 hidden size | 8 full attn heads / 16 linear heads |
| Vision ViT Encoder | 24 layers | 1024 (projected to 2048) | 16 heads / 16x16 patch projection |
| Embedding Projection | Unified Latent | 2048 output size | Linear adapter mapping per modality |
Hardware and Resource Consumption Matrix
The resource footprint scales based on context size and quantization precision. Use the following metrics as hardware requirements:
| Quantization Precision | Context Length | Peak VRAM (Inference) | Prefill Throughput (tokens/sec) | Decoding Speed (tokens/sec) | Recommended GPU |
|---|---|---|---|---|---|
| BF16 (No quantization) | 8,000 tokens | ~4.8 GB | ~1,200 | ~85 | RTX 4060 Ti / L4 |
| BF16 (No quantization) | 128,000 tokens | ~8.2 GB | ~980 | ~72 | RTX 4090 / A10G |
| BF16 (No quantization) | 262,144 tokens | ~14.4 GB | ~750 | ~58 | A100 80GB / H100 |
| FP8 (W8A8) | 128,000 tokens | ~5.1 GB | ~1,650 | ~115 | L4 / RTX 4090 |
| INT4 (AWQ/GPTQ) | 128,000 tokens | ~3.2 GB | ~2,100 | ~140 | Edge Jetson Orin |
Vision-Language Evaluation Benchmarks
Below is a summary of evaluation results for Rax 4.5 compared to other 2B and 7B vision-language architectures.
| Benchmark Category | Evaluation Suite / Dataset | Rax 4.5 (2B Hybrid) | Standard 2B VLM | Standard 7B VLM | Metric |
|---|---|---|---|---|---|
| Document Vision | DocVQA | 84.2% | 81.5% | 85.9% | ANLS |
| Chart Reading | ChartQA | 79.1% | 75.3% | 81.8% | Score |
| Infographic QA | InfoVQA | 68.6% | 63.8% | 70.4% | Accuracy |
| OCR Performance | OCRBench | 821 | 782 | 854 | Score |
| Math and Logic | MathVista | 48.6% | 44.1% | 51.2% | Score |
| General Multimodal | MMBench | 67.8% | 63.4% | 69.1% | Score |
| Video Understanding | VideoQA | 62.1% | 56.4% | 63.8% | Acc |
| Object Grounding | RefCOCO (Val) | 81.3% | 78.0% | 83.5% | [email protected] |
Validation and Diagnostic Checkpoints
To ensure the model has loaded correctly and the multimodal adapter offsets are properly aligned on your local system, run the following verification checks.
1. Verification of Tokenizer Boundaries
Execute this check to verify special tokens are registered and map to correct integer IDs:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("raxcore-dev/Rax-4.5", trust_remote_code=True)
required_tokens = ["<|im_start|>", "<|im_end|>", "<think>", "</think>", "<|vision_start|>"]
for token in required_tokens:
token_id = tokenizer.convert_tokens_to_ids(token)
assert token_id is not None and token_id != tokenizer.unk_token_id, f"Special token {token} not configured properly!"
print(f"Verified: {token:16} -> ID: {token_id}")
2. Context KV Cache Validation
Verify that the hybrid-attention text layers are correctly switching. Run this snippet to confirm linear/full attention distribution:
import torch
from transformers import AutoModelForVision2Seq
model = AutoModelForVision2Seq.from_pretrained(
"raxcore-dev/Rax-4.5",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="cpu"
)
# Access text layers to confirm linear/full attention distribution
layers = model.model.text_model.layers
for idx, layer in enumerate(layers):
attn_type = layer.self_attn.__class__.__name__
print(f"Layer {idx:02d}: {attn_type}")
Confirm that layers at index 3, 7, 11, 15, 19, 23 are configured as full attention blocks while all others are linear attention blocks.
Intended Uses and Limitations
Intended Use Cases
- Unified Document and Layout Auditing: Jointly parsing text scans, invoices, tables, and structural layout configurations.
- Edge Assistant and Web Agents: Navigating user interfaces, detecting interactive elements, and acting autonomously on-device.
- Embedded AI Systems: Deploying to memory-constrained devices (e.g., Nvidia Jetson Orin) for real-time camera integration.
- Visual Information Retrieval: Answering complex questions about visual media, charts, infographics, and temporal video feeds.
Limitations
- Ultra-Abstract Math/Logic: At 2 billion parameters, it may still experience reasoning failures on complex mathematical proofs compared to frontier 70B+ models.
- Niche Domain Visuals: Specialized scans (such as high-resolution medical MRI files or multi-band radar satellite images) require targeted domain fine-tuning.
- VRAM Overhead under Maximum Context: Processing the full 262K tokens with visual features still demands at least 12-16GB of dedicated VRAM, despite KV cache optimizations.
Citation
If you integrate Rax 4.5 into your research, development workflows, or production systems, please cite the model repository:
@misc{raxcore2026rax45,
title={Rax 4.5: Efficient Hybrid-Attention Vision Language Model},
author={Raxcore Team},
year={2026},
publisher={Hugging Face},
journal={Hugging Face Model Hub},
howpublished={\url{https://huggingface.co/raxcore-dev/Rax-4.5}}
}
License
This model is licensed under the Apache 2.0 License.
Contact: [email protected] | Website: raxcore.dev
- Downloads last month
- 86,240