Tri-21B

Introduction

Tri-21B를 4bit으로 양자화한 모델

We introduce Tri-21B, our flagship large language model that redefines the efficiency frontier in LLM training. By achieving state-of-the-art performance with only 2.3T training tokens, we demonstrate that exceptional capabilities don't require excessive computational resources.

Average Performance vs. Approximate Training FLOPs

Key Highlights

  • Unprecedented Training Efficiency: Trained on just 2.3T tokens—significantly less than comparable models—while achieving 70.3% average accuracy across MMLU/KMMLU/Global MMLU benchmarks
  • Pushing the Pareto Frontier: With only 2.95E+23 FLOPs, Tri-21B outperforms models requiring 2-10x more compute, setting a new standard for efficient scaling
  • Enhanced Reasoning: Modified training dataset mixture specifically optimized for reasoning capabilities
  • Advanced Post-Training: Significantly improved RL training pipeline focusing on mathematical reasoning and everyday usage
  • Multi-lingual: Specially optimized for Korean, English, and Japanese.

Our Tri-21B represents a paradigm shift in efficient model development. When comparing performance to training FLOPs, our model dramatically pushes the Pareto frontier—achieving performance comparable to or exceeding models like Qwen2.5-32B (74.6% at 3.46E+24 FLOPs) and Gemma 3 IT 27B (67.6% at 2.27E+24 FLOPs) while using approximately 8-12x fewer computational resources.

Model Specifications

Tri-21B

  • Type: Causal Language Model
  • Training Stage: Pre-training & Post-training
  • Architecture: Transformer Decoder with RoPE, SwiGLU, RMSNorm, and GQA
  • Number of Parameters: 20.73B
  • Number of Layers: 32
  • Number of Attention Heads: 32 (Query) / 8 (Key, Value)
  • Context Length: 8,192
  • Number of Tokens Seen: 2.3T
  • Vocab Size: 124,416

Training Efficiency Analysis

Our approach to training efficiency sets new benchmarks in the field. The following comparison demonstrates how Tri-21B achieves superior performance per FLOP compared to other state-of-the-art models of similar scale:

Model FLOPs Avg. Accuracy¹ Efficiency Ratio²
Tri-21B 2.95E+23 70.3% 1.00x (baseline)
Gemma2-9b 4.42E+23 61.5% 0.48x
Qwen2.5-7B 8.22E+23 63.4% 0.29x
Exaone-3.5-32B 1.25E+24 58.5% 0.19x
Gemma 3 IT 27B 2.27E+24 67.6% 0.11x
Qwen2.5-32B 3.46E+24 74.6% 0.10x
Qwen3-32B 5.77E+24 73.5% 0.06x

¹ Average of MMLU / KMMLU / Global MMLU (ja)
² Performance per FLOP relative to Tri-21B

This efficiency breakthrough enables organizations to deploy state-of-the-art language models without the traditional computational barriers, democratizing access to advanced AI capabilities.

Quickstart

Here is a code snippet with apply_chat_template that demonstrates how to load the tokenizer and model and generate text.

Tri-21B Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "trillionlabs/Tri-21B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Explain the concept of quantum computing in simple terms."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

vLLM, SGLang Deployment

Tri-21B is also available with vLLM and SGLang!

# vLLM
vllm serve trillionlabs/Tri-21B --dtype bfloat16 --max-model-len 8192

# vLLM with custom options
vllm serve trillionlabs/Tri-21B \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.95 \
    --port 8000
# SGLang
python3 -m sglang.launch_server --model-path trillionlabs/Tri-21B --dtype bfloat16

# SGLang with custom options
python3 -m sglang.launch_server \
    --model-path trillionlabs/Tri-21B \
    --dtype bfloat16 \
    --context-length 8192 \
    --port 30000 \
    --host 0.0.0.0

Evaluation

We evaluated Tri-21B across a comprehensive suite of benchmarks assessing general reasoning, knowledge recall, coding abilities, mathematical reasoning, and instruction-following capabilities. We compare our model against state-of-the-art models of similar scale: Gemmma-3-IT-27B and Qwen3-32B to demonstrate its competitive performance.

Full evaluation settings # Benchmark Evaluation Settings
Benchmark Language Evaluation Setting Metric
General Reasoning and Factuality
• HellaSwag English 0-shot accuracy
• ARC:C English 0-shot accuracy
• HAERAE Korean 3-shot accuracy
• CLIcK Korean 0-shot accuracy
• KoBEST Korean 5-shot accuracy
Knowledge and Reasoning
• KMMLU Korean 5-shot (0-shot, CoT) accuracy (exact-match)
• MMLU English 5-shot (0-shot, CoT) accuracy (exact-match)
• MMLU-Pro English 0-shot, CoT exact-match
• Global-MMLU-Lite-ja Japaneses 5-shot accuracy
Coding
• HumanEval English 0-shot pass@1
• MBPPPlus English 0-shot pass@1
Mathematical Reasoning
• GSM8k English 0-shot, CoT exact-match
• MATH English 0-shot, CoT exact-match
• GPQA English 4-shot accuracy
• GPQA Diamond English 0-shot, CoT accuracy
• HRM8k Korean 0-shot, CoT exact-match
Instruction Following and Chat
• IFEval English 0-shot strict-average
• koIFEval Korean 0-shot strict-average
• MT-Bench English LLM-as-a-judge (gpt-4o) LLM score
• KO-MT-Bench Korean LLM-as-a-judge (gpt-4o) LLM score
• systemIFEval English 0-shot strict-average
  • *Note that koIFEval, systemIFEval, and KoRuler are our in-house evaluation benchmarks adapted for Korean to better assess model capabilities in Korean language tasks.
  • **Note that MT-Bench, KO-MT-Bench, and LogicKor use a 10-point scale.

Benchmark Results

Models compared:

  • Tri-21B: Our flagship 21B parameter model
  • Qwen3-32B: Qwen's 32B parameter model
  • Gemma3-IT-27B: Google's Gemma 3 instruction-tuned 27B model

General Reasoning and Factuality

Benchmark Tri-21B Qwen3-32B Gemma3-IT-27B
HAERAE 86.16 71.67 78.09
KoBEST 85.92 83.39 87.66
CLIcK 72.32 66.89 67.54
KMMLU 61.89 (69.90) 61.73 (67.55) 55.03 (60.61)
MMLU 77.62 (85.02) 81.86 (84.46) 77.42 (84.09)
MMLU-Pro 64.74 70.53 64.26
Global-MMLU-Lite-ja 70.25 77.00 72.00

Coding

Benchmark Tri-21B Qwen3-32B Gemma3-IT-27B
HumanEval 75.61 74.39 87.80
MBPPPlus 73.02 74.40 84.92

Mathematical Reasoning

Benchmark Tri-21B Qwen3-32B Gemma3-IT-27B
GSM8k 87.95 86.66 90.52
MATH 77.60 81.40 85.00
GPQA 39.73 41.07 37.95
GPQA-Diamond 44.95 54.04 44.44
HRM8k 56.70 66.24 63.90

Instruction Following and Chat

Benchmark Tri-21B Qwen3-32B Gemma3-IT-27B
IFEval 80.75 86.08 80.78
koIFEval 66.51 62.93 69.24
MT-Bench 8.21 8.52 8.53
KO-MT-Bench 7.79 8.47 8.46
systemIFEval 77.40 77.92 77.94

Base Model Evaluation

The following table shows the performance of Tri-21B base model (before instruction tuning) on key benchmarks:

Benchmark Tri-21B Base
MMLU 76.99
KMMLU 62.37
KoBEST 85.07
BBH 77.19
GSM8K 70.36
MBPPPlus 75.40

Limitations

  • Language Support: The models are optimized for English, Korean, and Japanese. Usage with other languages may result in degraded performance.
  • Knowledge Cutoff: The model's information is limited to data available up to Febuary, 2025.

License

This model repository is licensed under the Trillion License.

Contact

For inquiries, please contact: info@trillionlabs.co

Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support