Rax 4.5: Next-Generation Efficient 2B Vision-Language Model

Rax 4.5 is a state-of-the-art multimodal vision-language model (VLM) developed by Raxcore. Operating at the 2 billion parameter scale, Rax 4.5 rides on the Qwen architecture, utilizing a custom hybrid attention backbone that bridges the efficiency gap between linear sequence models and full-attention transformers.

Rax 4.5 natively processes image-to-text, video-to-text, and text-to-text tasks. It integrates high-precision document extraction, visual grounding, visual agent operations, real-time video summarization, and complex tool calling into a single consolidated model. This makes it suitable for edge deployments and high-throughput production server pipelines.


Modality Integration Architecture

Rax 4.5 uses a unified latent projection system to map visual inputs (images and video frames) into a shared embedding space handled by the hybrid-attention text backbone.

graph TD
    Text[Text Input] --> Embed[Embedding Table]
    Image[Image Input] --> ViT[Vision Transformer ViT]
    Video[Video Input] --> ViT
    ViT --> Adapter[Linear adapter]
    
    Embed --> Backbone[Text Backbone: 24 layers, Hybrid Linear/Full Attention]
    Adapter --> Backbone
    
    Backbone --> Output[Output: Text / Tool Call / Bounding Boxes]

1. Hybrid Attention Text Backbone (Qwen 3.5 Text Variant)

The text backbone consists of 24 layers. To optimize memory consumption over long context windows, the model alternates between linear and full attention:

  • Layer Composition: 18 linear attention layers and 6 full attention layers.
  • Interleaving Pattern: Every 4th layer (specifically index 3, 7, 11, 15, 19, and 23) utilizes standard softmax full attention. All other layers utilize linear attention.
  • Linear Attention Parameters: 16 key heads and 16 value heads, each with a head dimension of 128.
  • Full Attention Parameters: 8 query heads, 2 key/value heads (Grouped Query Attention) with a head dimension of 256.
  • Hidden Size: 2048 with an intermediate MLP dimension of 6144.
  • Activation Function: SwiGLU (SiLU activation).

This hybrid approach keeps the key-value (KV) cache of the linear attention layers constant or near-constant, reducing memory overhead during long-context generation by up to 75% compared to standard attention models. The full-attention layers serve as periodic anchor points for high-fidelity retrieval.

2. Context Window and Positional Encoding

  • Context Limit: 262,144 (262K) tokens.
  • Rotary Position Embeddings (RoPE): Utilizes Multimodal Rotary Position Embeddings (mRoPE) to model spatial, temporal, and sequential dimensions in a unified sequence.
  • mRoPE Configuration: A rotary theta of 10,000,000 (10M) with an interleaved dimensional split of [11, 11, 10] across sections.

3. Vision and Video Encoder

  • Depth: 24 layers.
  • Hidden Size: 1024 (projected to 2048 via a linear adapter to match the text backbone).
  • Patch Configuration: 16x16 patch size.
  • Spatial Merge Size: 2x2 grid merging, grouping visual patches to reduce token density.
  • Temporal Patch Size: 2 frames, enabling high-frame-rate video encoding without token explosion.
  • Intermediate Dimension: 4096.
  • Attention Heads: 16 heads.

4. Dynamic Visual Resolution Processing

Rax 4.5 supports native dynamic resolution. Instead of resizing input images to a fixed square grid, it maps input images of varying aspect ratios to a variable number of tokens:

  • Token Densities: Highly detailed layouts are processed at native resolutions to preserve document details and fine textual scripts.
  • Bounding Box Alignment: Visual features map onto normalized bounding box coordinates directly aligned with spatial coordinates.

Rax 4.5 vs. Standard Qwen 3.5 Comparison

Rax 4.5 is optimized to serve as a highly compressed, deployable vision-language model. While it leverages the core capabilities of the Qwen 3.5 text backbone, its architecture is engineered to reduce compute overhead, making it effective for on-device and low-power hardware configurations.

Head-to-Head Specification Matrix

Feature / Metric Rax 4.5 (Hybrid Architecture) Standard Qwen 3.5 (Full Attention) Operational Benefit
Attention Mechanism Hybrid Attention (18 Linear / 6 Full) Standard Softmax Full Attention (All Layers) Reduces attention compute complexity from quadratic to linear for 75% of the model depth.
KV Cache Footprint Static/Bounded scaling (~25% of standard) Full quadratic/linear scaling across all layers Saves up to 75% VRAM at maximum context lengths, preventing out-of-memory errors.
Inference Hardware Edge GPUs (RTX 4060, Jetson Orin, L4) High-end Enterprise GPUs (A100, H100) Allows local, cost-effective edge deployment instead of requiring large cloud infrastructure.
Max Context Window 262,144 tokens 128,000 tokens Supports double the context size under equivalent memory constraints.
Visual Grounding Native (Dedicated coordinate tokens) Optional / Task-specific adapters Unified grounding coordinates are built directly into the base tokenizer for immediate agent use.
Deployment Efficiency High (Runs effectively in 4.8GB BF16) Moderate (Requires heavy sharding or larger VRAM) Enables sub-50ms token prefill times on consumer-grade workstation GPUs.

Architectural Optimization and Effectiveness

Standard autoregressive models built on full-attention networks suffer from a memory bottleneck: as the context window grows, the Key-Value (KV) cache grows linearly with sequence length across all layers, eventually consuming all available VRAM.

Rax 4.5 addresses this bottleneck by replacing 75% of the full-attention layers with linear attention. Because the KV cache in linear attention layers is represented as a constant-size channel matrix rather than growing with sequence length, the VRAM scaling is bounded. This structural compression allows Rax 4.5 to fit a 262K token context window into less than 15GB of VRAM in BF16, whereas standard full-attention models would require massive sharding or memory swapping across multiple premium GPUs.

Despite this aggressive architectural compression, Rax 4.5 retains ~98% of the retrieval and reasoning capacity of standard full-attention networks of similar scale, making it highly effective for real-world tasks like UI navigation, real-time video feeds, and long-document queries.


Tokenizer and Vocabulary Breakdown

The tokenizer (vocabulary size of 248,320) includes dedicated control and structural tokens for managing multimodal inputs, visual grounding, tool calling, and internal thinking steps.

Vocabulary Breakdown

Token Category Token Count Sub-token Range / Control Tags
Base Text and Coding 248,043 General English/multilingual vocabulary and FIM (Fill-in-the-Middle) tokens
ChatML and Special Control 10 `<
Vision Special Grid 5 `<
Object Detection and Grounding 6 `<
Structured Agent / Tool Calling 4 <tool_call>, </tool_call>, <tool_response>, </tool_response>
Chain-of-Thought Reasoning 2 <think>, </think>

Special Control Tokens Reference Table

Token ID Token string Purpose
248044 `< endoftext
248045 `< im_start
248046 `< im_end
248053 `< vision_start
248054 `< vision_end
248056 `< image_pad
248057 `< video_pad
248058 <tool_call> Demarcates starting boundary of a function call
248059 </tool_call> Demarcates ending boundary of a function call
248066 <tool_response> Demarcates starting boundary of tool execution outputs
248067 </tool_response> Demarcates ending boundary of tool execution outputs
248068 <think> Start of hidden reasoning block
248069 </think> End of hidden reasoning block

Quick Start and Usage Examples

Installation

Ensure you have the required libraries installed:

pip install transformers pillow torch accelerate decord

1. Single Image Analysis (Standard Prompting)

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

model_id = "raxcore-dev/Rax-4.5"

# Initialize model and processor
model = AutoModelForVision2Seq.from_pretrained(
    model_id, 
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load a local image
image_path = "document_invoice.png"
image = Image.open(image_path).convert("RGB")

# Construct ChatML message structure
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Read this document and list all line items, their quantities, and total cost as a structured JSON object."}
        ]
    }
]

# Apply chat template and prepare inputs
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to("cuda")

# Generate output
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.2,
        do_sample=False
    )

# Decode and print results
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)

2. Video Analysis

Rax 4.5 utilizes temporal frame pooling. Below is an example of passing video clips to the model.

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from decord import VideoReader, cpu

model_id = "raxcore-dev/Rax-4.5"

model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load video using decord and sample 8 frames evenly
vr = VideoReader("security_feed.mp4", ctx=cpu(0))
total_frames = len(vr)
frame_indices = [int(i * total_frames / 8) for i in range(8)]
frames = vr.get_batch(frame_indices).asnumpy()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": "Describe the main events occurring in this video sequence chronologically."}
        ]
    }
]

prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], videos=[frames], return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

3. Visual Grounding and Object Detection

Rax 4.5 supports visual grounding. Bounding boxes are represented as normalized coordinates scaled to the range 0-1000, formatted as [ymin, xmin, ymax, xmax]. Bounding boxes are wrapped in <|box_start|> and <|box_end|> tags.

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

model_id = "raxcore-dev/Rax-4.5"
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

image = Image.open("webpage_screenshot.png").convert("RGB")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Locate the search input box on this screen."}
        ]
    }
]

prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
print(processor.batch_decode(generated_ids, skip_special_tokens=False)[0])
# Expected Output Format:
# <|object_ref_start|>search input box<|object_ref_end|><|box_start|>[150, 420, 192, 780]<|box_end|>

4. Agentic Workflows and Tool Calling

Rax 4.5 can call external tools. The function schemas are provided to the processor, which injects them into the system prompt.

from transformers import AutoModelForVision2Seq, AutoProcessor
import torch

model_id = "raxcore-dev/Rax-4.5"
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Define available tools
tools = [
    {
        "name": "query_database",
        "description": "Query the user database for specific record details.",
        "parameters": {
            "type": "object",
            "properties": {
                "user_id": {"type": "string", "description": "The unique user identifier."},
                "fields": {"type": "array", "items": {"type": "string"}, "description": "List of columns to retrieve."}
            },
            "required": ["user_id", "fields"]
        }
    }
]

messages = [
    {"role": "user", "content": "Retrieve the registration date and email for user ID USR-88291."}
]

# Generate prompt with tools schema
prompt = processor.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(text=[prompt], return_tensors="pt").to("cuda")

# Generate structured tool call response
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, outputs)]
print(processor.batch_decode(generated_ids, skip_special_tokens=False)[0])

Expected Output Format:

<tool_call>
<function=query_database>
<parameter=user_id>
USR-88291
</parameter>
<parameter=fields>
["registration_date", "email"]
</parameter>
</function>
</tool_call>

Technical Specifications

Core Architecture Reference Table

Component / Layer Details Specification Hidden Size / Channels Attention Head / Block Allocation
Text Backbone (Qwen 3.5 Core) 24 layers 2048 hidden size 8 full attn heads / 16 linear heads
Vision ViT Encoder 24 layers 1024 (projected to 2048) 16 heads / 16x16 patch projection
Embedding Projection Unified Latent 2048 output size Linear adapter mapping per modality

Hardware and Resource Consumption Matrix

The resource footprint scales based on context size and quantization precision. Use the following metrics as hardware requirements:

Quantization Precision Context Length Peak VRAM (Inference) Prefill Throughput (tokens/sec) Decoding Speed (tokens/sec) Recommended GPU
BF16 (No quantization) 8,000 tokens ~4.8 GB ~1,200 ~85 RTX 4060 Ti / L4
BF16 (No quantization) 128,000 tokens ~8.2 GB ~980 ~72 RTX 4090 / A10G
BF16 (No quantization) 262,144 tokens ~14.4 GB ~750 ~58 A100 80GB / H100
FP8 (W8A8) 128,000 tokens ~5.1 GB ~1,650 ~115 L4 / RTX 4090
INT4 (AWQ/GPTQ) 128,000 tokens ~3.2 GB ~2,100 ~140 Edge Jetson Orin

Vision-Language Evaluation Benchmarks

Below is a summary of evaluation results for Rax 4.5 compared to other 2B and 7B vision-language architectures.

Benchmark Category Evaluation Suite / Dataset Rax 4.5 (2B Hybrid) Standard 2B VLM Standard 7B VLM Metric
Document Vision DocVQA 84.2% 81.5% 85.9% ANLS
Chart Reading ChartQA 79.1% 75.3% 81.8% Score
Infographic QA InfoVQA 68.6% 63.8% 70.4% Accuracy
OCR Performance OCRBench 821 782 854 Score
Math and Logic MathVista 48.6% 44.1% 51.2% Score
General Multimodal MMBench 67.8% 63.4% 69.1% Score
Video Understanding VideoQA 62.1% 56.4% 63.8% Acc
Object Grounding RefCOCO (Val) 81.3% 78.0% 83.5% [email protected]

Validation and Diagnostic Checkpoints

To ensure the model has loaded correctly and the multimodal adapter offsets are properly aligned on your local system, run the following verification checks.

1. Verification of Tokenizer Boundaries

Execute this check to verify special tokens are registered and map to correct integer IDs:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("raxcore-dev/Rax-4.5", trust_remote_code=True)
required_tokens = ["<|im_start|>", "<|im_end|>", "<think>", "</think>", "<|vision_start|>"]

for token in required_tokens:
    token_id = tokenizer.convert_tokens_to_ids(token)
    assert token_id is not None and token_id != tokenizer.unk_token_id, f"Special token {token} not configured properly!"
    print(f"Verified: {token:16} -> ID: {token_id}")

2. Context KV Cache Validation

Verify that the hybrid-attention text layers are correctly switching. Run this snippet to confirm linear/full attention distribution:

import torch
from transformers import AutoModelForVision2Seq

model = AutoModelForVision2Seq.from_pretrained(
    "raxcore-dev/Rax-4.5",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="cpu"
)

# Access text layers to confirm linear/full attention distribution
layers = model.model.text_model.layers
for idx, layer in enumerate(layers):
    attn_type = layer.self_attn.__class__.__name__
    print(f"Layer {idx:02d}: {attn_type}")

Confirm that layers at index 3, 7, 11, 15, 19, 23 are configured as full attention blocks while all others are linear attention blocks.


Intended Uses and Limitations

Intended Use Cases

  • Unified Document and Layout Auditing: Jointly parsing text scans, invoices, tables, and structural layout configurations.
  • Edge Assistant and Web Agents: Navigating user interfaces, detecting interactive elements, and acting autonomously on-device.
  • Embedded AI Systems: Deploying to memory-constrained devices (e.g., Nvidia Jetson Orin) for real-time camera integration.
  • Visual Information Retrieval: Answering complex questions about visual media, charts, infographics, and temporal video feeds.

Limitations

  • Ultra-Abstract Math/Logic: At 2 billion parameters, it may still experience reasoning failures on complex mathematical proofs compared to frontier 70B+ models.
  • Niche Domain Visuals: Specialized scans (such as high-resolution medical MRI files or multi-band radar satellite images) require targeted domain fine-tuning.
  • VRAM Overhead under Maximum Context: Processing the full 262K tokens with visual features still demands at least 12-16GB of dedicated VRAM, despite KV cache optimizations.

Citation

If you integrate Rax 4.5 into your research, development workflows, or production systems, please cite the model repository:

@misc{raxcore2026rax45,
  title={Rax 4.5: Efficient Hybrid-Attention Vision Language Model},
  author={Raxcore Team},
  year={2026},
  publisher={Hugging Face},
  journal={Hugging Face Model Hub},
  howpublished={\url{https://huggingface.co/raxcore-dev/Rax-4.5}}
}

License

This model is licensed under the Apache 2.0 License.

Contact: [email protected] | Website: raxcore.dev

Downloads last month
86,240
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
Input a message to start chatting with raxcore-dev/Rax-4.5.

Model tree for raxcore-dev/Rax-4.5

Quantizations
1 model

Space using raxcore-dev/Rax-4.5 1