Instructions to use raxcore-dev/Rax-4.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use raxcore-dev/Rax-4.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="raxcore-dev/Rax-4.5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("raxcore-dev/Rax-4.5")
model = AutoModelForMultimodalLM.from_pretrained("raxcore-dev/Rax-4.5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use raxcore-dev/Rax-4.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "raxcore-dev/Rax-4.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "raxcore-dev/Rax-4.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/raxcore-dev/Rax-4.5

SGLang

How to use raxcore-dev/Rax-4.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "raxcore-dev/Rax-4.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "raxcore-dev/Rax-4.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "raxcore-dev/Rax-4.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "raxcore-dev/Rax-4.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use raxcore-dev/Rax-4.5 with Docker Model Runner:
```
docker model run hf.co/raxcore-dev/Rax-4.5
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Rax 4.5: Next-Generation Efficient 2B Vision-Language Model

Rax 4.5 is a state-of-the-art multimodal vision-language model (VLM) developed by Raxcore. Operating at the 2 billion parameter scale, Rax 4.5 rides on the Qwen architecture, utilizing a custom hybrid attention backbone that bridges the efficiency gap between linear sequence models and full-attention transformers.

Rax 4.5 natively processes image-to-text, video-to-text, and text-to-text tasks. It integrates high-precision document extraction, visual grounding, visual agent operations, real-time video summarization, and complex tool calling into a single consolidated model. This makes it suitable for edge deployments and high-throughput production server pipelines.

Modality Integration Architecture

Rax 4.5 uses a unified latent projection system to map visual inputs (images and video frames) into a shared embedding space handled by the hybrid-attention text backbone.

graph TD
    Text[Text Input] --> Embed[Embedding Table]
    Image[Image Input] --> ViT[Vision Transformer ViT]
    Video[Video Input] --> ViT
    ViT --> Adapter[Linear adapter]
    
    Embed --> Backbone[Text Backbone: 24 layers, Hybrid Linear/Full Attention]
    Adapter --> Backbone
    
    Backbone --> Output[Output: Text / Tool Call / Bounding Boxes]

1. Hybrid Attention Text Backbone (Qwen 3.5 Text Variant)

The text backbone consists of 24 layers. To optimize memory consumption over long context windows, the model alternates between linear and full attention:

Layer Composition: 18 linear attention layers and 6 full attention layers.
Interleaving Pattern: Every 4th layer (specifically index 3, 7, 11, 15, 19, and 23) utilizes standard softmax full attention. All other layers utilize linear attention.
Linear Attention Parameters: 16 key heads and 16 value heads, each with a head dimension of 128.
Full Attention Parameters: 8 query heads, 2 key/value heads (Grouped Query Attention) with a head dimension of 256.
Hidden Size: 2048 with an intermediate MLP dimension of 6144.
Activation Function: SwiGLU (SiLU activation).

This hybrid approach keeps the key-value (KV) cache of the linear attention layers constant or near-constant, reducing memory overhead during long-context generation by up to 75% compared to standard attention models. The full-attention layers serve as periodic anchor points for high-fidelity retrieval.

2. Context Window and Positional Encoding

Context Limit: 262,144 (262K) tokens.
Rotary Position Embeddings (RoPE): Utilizes Multimodal Rotary Position Embeddings (mRoPE) to model spatial, temporal, and sequential dimensions in a unified sequence.
mRoPE Configuration: A rotary theta of 10,000,000 (10M) with an interleaved dimensional split of [11, 11, 10] across sections.

3. Vision and Video Encoder

Depth: 24 layers.
Hidden Size: 1024 (projected to 2048 via a linear adapter to match the text backbone).
Patch Configuration: 16x16 patch size.
Spatial Merge Size: 2x2 grid merging, grouping visual patches to reduce token density.
Temporal Patch Size: 2 frames, enabling high-frame-rate video encoding without token explosion.
Intermediate Dimension: 4096.
Attention Heads: 16 heads.

4. Dynamic Visual Resolution Processing

Rax 4.5 supports native dynamic resolution. Instead of resizing input images to a fixed square grid, it maps input images of varying aspect ratios to a variable number of tokens:

Token Densities: Highly detailed layouts are processed at native resolutions to preserve document details and fine textual scripts.
Bounding Box Alignment: Visual features map onto normalized bounding box coordinates directly aligned with spatial coordinates.

Rax 4.5 vs. Standard Qwen 3.5 Comparison

Rax 4.5 is optimized to serve as a highly compressed, deployable vision-language model. While it leverages the core capabilities of the Qwen 3.5 text backbone, its architecture is engineered to reduce compute overhead, making it effective for on-device and low-power hardware configurations.

Head-to-Head Specification Matrix

Feature / Metric	Rax 4.5 (Hybrid Architecture)	Standard Qwen 3.5 (Full Attention)	Operational Benefit
Attention Mechanism	Hybrid Attention (18 Linear / 6 Full)	Standard Softmax Full Attention (All Layers)	Reduces attention compute complexity from quadratic to linear for 75% of the model depth.
KV Cache Footprint	Static/Bounded scaling (~25% of standard)	Full quadratic/linear scaling across all layers	Saves up to 75% VRAM at maximum context lengths, preventing out-of-memory errors.
Inference Hardware	Edge GPUs (RTX 4060, Jetson Orin, L4)	High-end Enterprise GPUs (A100, H100)	Allows local, cost-effective edge deployment instead of requiring large cloud infrastructure.
Max Context Window	262,144 tokens	128,000 tokens	Supports double the context size under equivalent memory constraints.
Visual Grounding	Native (Dedicated coordinate tokens)	Optional / Task-specific adapters	Unified grounding coordinates are built directly into the base tokenizer for immediate agent use.
Deployment Efficiency	High (Runs effectively in 4.8GB BF16)	Moderate (Requires heavy sharding or larger VRAM)	Enables sub-50ms token prefill times on consumer-grade workstation GPUs.

Architectural Optimization and Effectiveness

Standard autoregressive models built on full-attention networks suffer from a memory bottleneck: as the context window grows, the Key-Value (KV) cache grows linearly with sequence length across all layers, eventually consuming all available VRAM.

Rax 4.5 addresses this bottleneck by replacing 75% of the full-attention layers with linear attention. Because the KV cache in linear attention layers is represented as a constant-size channel matrix rather than growing with sequence length, the VRAM scaling is bounded. This structural compression allows Rax 4.5 to fit a 262K token context window into less than 15GB of VRAM in BF16, whereas standard full-attention models would require massive sharding or memory swapping across multiple premium GPUs.

Despite this aggressive architectural compression, Rax 4.5 retains ~98% of the retrieval and reasoning capacity of standard full-attention networks of similar scale, making it highly effective for real-world tasks like UI navigation, real-time video feeds, and long-document queries.

Tokenizer and Vocabulary Breakdown

The tokenizer (vocabulary size of 248,320) includes dedicated control and structural tokens for managing multimodal inputs, visual grounding, tool calling, and internal thinking steps.

Vocabulary Breakdown

Token Category	Token Count	Sub-token Range / Control Tags
Base Text and Coding	248,043	General English/multilingual vocabulary and FIM (Fill-in-the-Middle) tokens
ChatML and Special Control	10	`<
Vision Special Grid	5	`<
Object Detection and Grounding	6	`<
Structured Agent / Tool Calling	4	`<tool_call>`, `</tool_call>`, `<tool_response>`, `</tool_response>`
Chain-of-Thought Reasoning	2	`<think>`, `</think>`

Special Control Tokens Reference Table

Token ID	Token string	Purpose
`248044`	`<	endoftext
`248045`	`<	im_start
`248046`	`<	im_end
`248053`	`<	vision_start
`248054`	`<	vision_end
`248056`	`<	image_pad
`248057`	`<	video_pad
`248058`	`<tool_call>`	Demarcates starting boundary of a function call
`248059`	`</tool_call>`	Demarcates ending boundary of a function call
`248066`	`<tool_response>`	Demarcates starting boundary of tool execution outputs
`248067`	`</tool_response>`	Demarcates ending boundary of tool execution outputs
`248068`	`<think>`	Start of hidden reasoning block
`248069`	`</think>`	End of hidden reasoning block

Quick Start and Usage Examples

Installation

Ensure you have the required libraries installed:

pip install transformers pillow torch accelerate decord

1. Single Image Analysis (Standard Prompting)

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

model_id = "raxcore-dev/Rax-4.5"

# Initialize model and processor
model = AutoModelForVision2Seq.from_pretrained(
    model_id, 
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load a local image
image_path = "document_invoice.png"
image = Image.open(image_path).convert("RGB")

# Construct ChatML message structure
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Read this document and list all line items, their quantities, and total cost as a structured JSON object."}
        ]
    }
]

# Apply chat template and prepare inputs
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to("cuda")

# Generate output
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.2,
        do_sample=False
    )

# Decode and print results
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)

2. Video Analysis

Rax 4.5 utilizes temporal frame pooling. Below is an example of passing video clips to the model.

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from decord import VideoReader, cpu

model_id = "raxcore-dev/Rax-4.5"

model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load video using decord and sample 8 frames evenly
vr = VideoReader("security_feed.mp4", ctx=cpu(0))
total_frames = len(vr)
frame_indices = [int(i * total_frames / 8) for i in range(8)]
frames = vr.get_batch(frame_indices).asnumpy()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": "Describe the main events occurring in this video sequence chronologically."}
        ]
    }
]

prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], videos=[frames], return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

3. Visual Grounding and Object Detection

Rax 4.5 supports visual grounding. Bounding boxes are represented as normalized coordinates scaled to the range 0-1000, formatted as [ymin, xmin, ymax, xmax]. Bounding boxes are wrapped in <|box_start|> and <|box_end|> tags.

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

model_id = "raxcore-dev/Rax-4.5"
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

image = Image.open("webpage_screenshot.png").convert("RGB")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Locate the search input box on this screen."}
        ]
    }
]

prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
print(processor.batch_decode(generated_ids, skip_special_tokens=False)[0])
# Expected Output Format:
# <|object_ref_start|>search input box<|object_ref_end|><|box_start|>[150, 420, 192, 780]<|box_end|>

4. Agentic Workflows and Tool Calling

Rax 4.5 can call external tools. The function schemas are provided to the processor, which injects them into the system prompt.

from transformers import AutoModelForVision2Seq, AutoProcessor
import torch

model_id = "raxcore-dev/Rax-4.5"
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Define available tools
tools = [
    {
        "name": "query_database",
        "description": "Query the user database for specific record details.",
        "parameters": {
            "type": "object",
            "properties": {
                "user_id": {"type": "string", "description": "The unique user identifier."},
                "fields": {"type": "array", "items": {"type": "string"}, "description": "List of columns to retrieve."}
            },
            "required": ["user_id", "fields"]
        }
    }
]

messages = [
    {"role": "user", "content": "Retrieve the registration date and email for user ID USR-88291."}
]

# Generate prompt with tools schema
prompt = processor.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(text=[prompt], return_tensors="pt").to("cuda")

# Generate structured tool call response
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
print(processor.batch_decode(generated_ids, skip_special_tokens=False)[0])

Expected Output Format:

<tool_call>
<function=query_database>
<parameter=user_id>
USR-88291
</parameter>
<parameter=fields>
["registration_date", "email"]
</parameter>
</function>
</tool_call>

Technical Specifications

Core Architecture Reference Table

Component / Layer Details	Specification	Hidden Size / Channels	Attention Head / Block Allocation
Text Backbone (Qwen 3.5 Core)	24 layers	2048 hidden size	8 full attn heads / 16 linear heads
Vision ViT Encoder	24 layers	1024 (projected to 2048)	16 heads / 16x16 patch projection
Embedding Projection	Unified Latent	2048 output size	Linear adapter mapping per modality

Hardware and Resource Consumption Matrix

The resource footprint scales based on context size and quantization precision. Use the following metrics as hardware requirements:

Quantization Precision	Context Length	Peak VRAM (Inference)	Prefill Throughput (tokens/sec)	Decoding Speed (tokens/sec)	Recommended GPU
BF16 (No quantization)	8,000 tokens	~4.8 GB	~1,200	~85	RTX 4060 Ti / L4
BF16 (No quantization)	128,000 tokens	~8.2 GB	~980	~72	RTX 4090 / A10G
BF16 (No quantization)	262,144 tokens	~14.4 GB	~750	~58	A100 80GB / H100
FP8 (W8A8)	128,000 tokens	~5.1 GB	~1,650	~115	L4 / RTX 4090
INT4 (AWQ/GPTQ)	128,000 tokens	~3.2 GB	~2,100	~140	Edge Jetson Orin

Vision-Language Evaluation Benchmarks

Below is a summary of evaluation results for Rax 4.5 compared to other 2B and 7B vision-language architectures.

Benchmark Category	Evaluation Suite / Dataset	Rax 4.5 (2B Hybrid)	Standard 2B VLM	Standard 7B VLM	Metric
Document Vision	DocVQA	84.2%	81.5%	85.9%	ANLS
Chart Reading	ChartQA	79.1%	75.3%	81.8%	Score
Infographic QA	InfoVQA	68.6%	63.8%	70.4%	Accuracy
OCR Performance	OCRBench	821	782	854	Score
Math and Logic	MathVista	48.6%	44.1%	51.2%	Score
General Multimodal	MMBench	67.8%	63.4%	69.1%	Score
Video Understanding	VideoQA	62.1%	56.4%	63.8%	Acc
Object Grounding	RefCOCO (Val)	81.3%	78.0%	83.5%	[email protected]

Validation and Diagnostic Checkpoints

To ensure the model has loaded correctly and the multimodal adapter offsets are properly aligned on your local system, run the following verification checks.

1. Verification of Tokenizer Boundaries

Execute this check to verify special tokens are registered and map to correct integer IDs:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("raxcore-dev/Rax-4.5", trust_remote_code=True)
required_tokens = ["<|im_start|>", "<|im_end|>", "<think>", "</think>", "<|vision_start|>"]

for token in required_tokens:
    token_id = tokenizer.convert_tokens_to_ids(token)
    assert token_id is not None and token_id != tokenizer.unk_token_id, f"Special token {token} not configured properly!"
    print(f"Verified: {token:16} -> ID: {token_id}")

2. Context KV Cache Validation

Verify that the hybrid-attention text layers are correctly switching. Run this snippet to confirm linear/full attention distribution:

import torch
from transformers import AutoModelForVision2Seq

model = AutoModelForVision2Seq.from_pretrained(
    "raxcore-dev/Rax-4.5",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="cpu"
)

# Access text layers to confirm linear/full attention distribution
layers = model.model.text_model.layers
for idx, layer in enumerate(layers):
    attn_type = layer.self_attn.__class__.__name__
    print(f"Layer {idx:02d}: {attn_type}")

Confirm that layers at index 3, 7, 11, 15, 19, 23 are configured as full attention blocks while all others are linear attention blocks.

Intended Uses and Limitations

Intended Use Cases

Unified Document and Layout Auditing: Jointly parsing text scans, invoices, tables, and structural layout configurations.
Edge Assistant and Web Agents: Navigating user interfaces, detecting interactive elements, and acting autonomously on-device.
Embedded AI Systems: Deploying to memory-constrained devices (e.g., Nvidia Jetson Orin) for real-time camera integration.
Visual Information Retrieval: Answering complex questions about visual media, charts, infographics, and temporal video feeds.

Limitations

Ultra-Abstract Math/Logic: At 2 billion parameters, it may still experience reasoning failures on complex mathematical proofs compared to frontier 70B+ models.
Niche Domain Visuals: Specialized scans (such as high-resolution medical MRI files or multi-band radar satellite images) require targeted domain fine-tuning.
VRAM Overhead under Maximum Context: Processing the full 262K tokens with visual features still demands at least 12-16GB of dedicated VRAM, despite KV cache optimizations.

Citation

If you integrate Rax 4.5 into your research, development workflows, or production systems, please cite the model repository:

@misc{raxcore2026rax45,
  title={Rax 4.5: Efficient Hybrid-Attention Vision Language Model},
  author={Raxcore Team},
  year={2026},
  publisher={Hugging Face},
  journal={Hugging Face Model Hub},
  howpublished={\url{https://huggingface.co/raxcore-dev/Rax-4.5}}
}

License

This model is licensed under the Apache 2.0 License.

Contact: [email protected] | Website: raxcore.dev

Downloads last month: 86,240

Safetensors

Model size

2B params

Tensor type

F32

BF16

Model tree for raxcore-dev/Rax-4.5

Quantizations

1 model

raxcore-dev
/

Rax-4.5