Colonel Blotto: Advanced RL + LLM System for NeurIPS 2025
This repository contains trained models for the Colonel Blotto game, targeting the NeurIPS 2025 MindGames workshop. The system combines cutting-edge reinforcement learning with large language model fine-tuning.
π― Model Overview
This is an advanced system that achieves strong performance on Colonel Blotto through:
- Graph Neural Networks for game state representation
- FiLM layers for fast opponent adaptation
- Meta-learning for strategy portfolios
- LLM fine-tuning (SFT + DPO) for strategic reasoning
- Distillation from LLMs back to efficient RL policies
Game Configuration
- Fields: 3
- Units per round: 20
- Rounds per game: 5
- Training episodes: N/A
π Performance Results
Against Scripted Opponents
Overall Win Rate: 0.00%
LLM Elo Ratings
| Model | Elo Rating |
|---|
ποΈ Architecture
Policy Network
The core policy network uses a sophisticated architecture:
Graph Encoder: Multi-layer Graph Attention Networks (GAT)
- Heterogeneous nodes: field nodes, round nodes, summary node
- Multi-head attention with 6 heads
- 3 layers of message passing
Opponent Encoder: MLP-based encoder for opponent modeling
- Processes opponent history
- Learns behavioral patterns
FiLM Layers: Feature-wise Linear Modulation
- Fast adaptation to opponent behavior
- Conditioned on opponent encoding
Portfolio Head: Multi-strategy selection
- 6 specialist strategy heads
- Soft attention-based mixing
Training Pipeline
The models were trained through a comprehensive 7-phase pipeline:
- Phase A: Environment setup and action space generation
- Phase B: PPO training against diverse scripted opponents
- Phase C: Preference dataset generation (LLM vs LLM rollouts)
- Phase D: Supervised Fine-Tuning (SFT) of base LLM
- Phase E: Direct Preference Optimization (DPO)
- Phase F: Knowledge distillation from LLM to policy
- Phase G: PPO refinement after distillation
π¦ Repository Contents
Policy Models
policy_models/policy_final.pt: PyTorch checkpointpolicy_models/policy_after_distill.pt: PyTorch checkpointpolicy_models/policy_after_ppo.pt: PyTorch checkpoint
Fine-tuned LLM Models
sft_model/: SFT model (HuggingFace Transformers compatible)dpo_model/: DPO model (HuggingFace Transformers compatible)
Configuration & Results
master_config.json: Complete training configurationbattleground_eval.json: Comprehensive evaluation resultseval_scripted_after_ppo.json: Post-PPO evaluation
π Usage
Loading Policy Model
import torch
from your_policy_module import PolicyNet
# Load configuration
with open("master_config.json", "r") as f:
config = json.load(f)
# Initialize policy
policy = PolicyNet(
Ff=config["F"],
n_actions=231, # For F=3, U=20
hidden=config["hidden"],
gnn_layers=config["gnn_layers"],
gnn_heads=config["gnn_heads"],
n_strat=config["n_strat"]
)
# Load trained weights
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()
Loading Fine-tuned LLM
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")
# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
π Research Context
This work targets the NeurIPS 2025 MindGames Workshop with a focus on:
- Strategic game AI beyond traditional game-theoretic approaches
- Hybrid systems combining neural RL and LLM reasoning
- Fast adaptation to diverse opponents through meta-learning
- Efficient deployment via distillation
Key Innovations
- Heterogeneous Graph Representation: Novel graph structure for Blotto game states
- Ground-truth Counterfactual Learning: Exploiting game determinism
- Multi-scale Representation: Field-level, round-level, and game-level embeddings
- LLM-to-RL Distillation: Transferring strategic reasoning to efficient policies
π Citation
If you use this work, please cite:
@misc{colonelblotto2025neurips,
title={{Advanced Reinforcement Learning System for Colonel Blotto Games}},
author={{NeurIPS 2025 MindGames Submission}},
year={2025},
publisher={HuggingFace Hub},
howpublished={{\url{{https://huggingface.co/{repo_id}}}}},
}
π License
MIT License - See LICENSE file for details
π Acknowledgments
- Built for NeurIPS 2025 MindGames Workshop
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU
Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} Uploaded from: Notebook Environment