Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation

This repository contains trained Colonel Blotto agents developed for the NeurIPS 2025 MindGames Workshop.
The system integrates a compact graph-based reinforcement learning policy with LLM-guided preference learning and distillation, enabling improved strategic adaptation without increasing policy capacity.

Overview

The approach combines:

Graph Attention Networks for structured game-state encoding
Proximal Policy Optimization (PPO) as the core learning algorithm
FiLM-based opponent adaptation for fast response to opponent behavior
Rollout-grounded preference learning using two large language models
Supervised fine tuning (SFT) and Direct Preference Optimization (DPO) for teacher alignment
Knowledge distillation from the aligned teacher into an efficient policy

The goal is not to replace RL with language models, but to inject strategic priors learned by LLMs back into a lightweight, fast policy suitable for competitive play.

Game Configuration

Game: Colonel Blotto
Battlefields: 3
Units per round: 20
Rounds per game: 5
Action space size: 231 valid allocations
Evaluation protocol: Fixed scripted and adaptive opponent pool

Policy Architecture

Graph-Based State Encoder

Heterogeneous graph with 25–40 nodes
Node types include:
- Battlefield nodes
- Recent round summary nodes
- Global state node
Node feature dimension: 32
Encoder:
- 3 Graph Attention layers
- 6 attention heads
- Hidden size 192

Opponent Modeling and Adaptation

Opponent history encoded via a lightweight MLP
FiLM adaptation layers modulate policy activations based on opponent embedding
Enables rapid adjustment to non-stationary strategies

Action Head

Portfolio-based action head with 6 latent strategies
Strategies mixed via learned attention
Total policy parameters: ~6.8M

Training Pipeline

Training follows a multi-stage curriculum:

Graph PPO Pretraining
- PPO with clip ratio 0.2
- Discount factor γ = 0.99
- GAE λ = 0.95
- Trained against a diverse scripted opponent pool
Preference Generation via Rollouts
- ~800 intermediate states sampled
- Candidate actions proposed by:
  - Llama 3.1 Instruct
  - Qwen 2.5 Instruct
- Each proposal evaluated with 4 stochastic rollouts
- Higher-return actions labeled preferred
- ~2,300 preference pairs generated
Teacher Alignment
- Supervised Fine Tuning on chosen actions
- Direct Preference Optimization using frozen reference model
Policy Distillation
- Aligned teacher generates state-to-action labels
- Graph policy trained via cross-entropy imitation
Final PPO Refinement
- PPO resumes using environment rewards
- Stabilizes behavior after distillation

Evaluation Results

Evaluation uses 1,000 games against a mixture of scripted and adaptive opponents.

Agent	Win Rate	Risk Metric
PPO only	58.4% ± 2.1	Allocation collapse 14.2%
PPO + Distillation	67.9% ± 1.8	Allocation collapse 8.8%
Full curriculum	78.4%	Exploitability proxy 0.48

Allocation collapse: fraction of rounds placing >60% units on one field
Distillation yields a +9.5 point win-rate gain over PPO
Full curriculum yields +20 point gain with reduced over-specialization

These improvements arise from risk calibration and opponent-aware adaptation, not brute-force exploitation.

Repository Contents

Policy Checkpoints

policy_models/policy_after_ppo.pt
policy_models/policy_after_distill.pt
policy_models/policy_final.pt

LLM Teacher Models

sft_model/ – supervised fine-tuned model
dpo_model/ – preference-aligned model

Configuration and Logs

master_config.json – training configuration
battleground_eval.json – evaluation summaries

Usage

Load Policy

import torch
from policy import GraphPolicy

policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()


### Loading Fine-tuned LLM

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")

# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)

🎓 Research Context

This work targets the NeurIPS 2025 MindGames Workshop with a focus on:

Language models function effectively as strategic prior generators when grounded by rollouts
Graph-based representations enable cross-strategy generalization under compact policies
Distillation transfers high-level reasoning into fast, deployable agents

Key Innovations

Heterogeneous Graph Representation: Novel graph structure for Blotto game states
Ground-truth Counterfactual Learning: Exploiting game determinism
Multi-scale Representation: Field-level, round-level, and game-level embeddings
LLM-to-RL Distillation: Transferring strategic reasoning to efficient policies

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Built for NeurIPS 2025 MindGames Workshop
Uses PyTorch, HuggingFace Transformers, and PEFT
Training infrastructure: NVIDIA H200 GPU

Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} Uploaded from: Notebook Environment

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning