Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation

Status Framework License

This repository contains trained Colonel Blotto agents developed for the NeurIPS 2025 MindGames Workshop.
The system integrates a compact graph-based reinforcement learning policy with LLM-guided preference learning and distillation, enabling improved strategic adaptation without increasing policy capacity.


Overview

The approach combines:

  • Graph Attention Networks for structured game-state encoding
  • Proximal Policy Optimization (PPO) as the core learning algorithm
  • FiLM-based opponent adaptation for fast response to opponent behavior
  • Rollout-grounded preference learning using two large language models
  • Supervised fine tuning (SFT) and Direct Preference Optimization (DPO) for teacher alignment
  • Knowledge distillation from the aligned teacher into an efficient policy

The goal is not to replace RL with language models, but to inject strategic priors learned by LLMs back into a lightweight, fast policy suitable for competitive play.


Game Configuration

  • Game: Colonel Blotto
  • Battlefields: 3
  • Units per round: 20
  • Rounds per game: 5
  • Action space size: 231 valid allocations
  • Evaluation protocol: Fixed scripted and adaptive opponent pool

Policy Architecture

Graph-Based State Encoder

  • Heterogeneous graph with 25–40 nodes
  • Node types include:
    • Battlefield nodes
    • Recent round summary nodes
    • Global state node
  • Node feature dimension: 32
  • Encoder:
    • 3 Graph Attention layers
    • 6 attention heads
    • Hidden size 192

Opponent Modeling and Adaptation

  • Opponent history encoded via a lightweight MLP
  • FiLM adaptation layers modulate policy activations based on opponent embedding
  • Enables rapid adjustment to non-stationary strategies

Action Head

  • Portfolio-based action head with 6 latent strategies
  • Strategies mixed via learned attention
  • Total policy parameters: ~6.8M

Training Pipeline

Training follows a multi-stage curriculum:

  1. Graph PPO Pretraining

    • PPO with clip ratio 0.2
    • Discount factor Ξ³ = 0.99
    • GAE Ξ» = 0.95
    • Trained against a diverse scripted opponent pool
  2. Preference Generation via Rollouts

    • ~800 intermediate states sampled
    • Candidate actions proposed by:
      • Llama 3.1 Instruct
      • Qwen 2.5 Instruct
    • Each proposal evaluated with 4 stochastic rollouts
    • Higher-return actions labeled preferred
    • ~2,300 preference pairs generated
  3. Teacher Alignment

    • Supervised Fine Tuning on chosen actions
    • Direct Preference Optimization using frozen reference model
  4. Policy Distillation

    • Aligned teacher generates state-to-action labels
    • Graph policy trained via cross-entropy imitation
  5. Final PPO Refinement

    • PPO resumes using environment rewards
    • Stabilizes behavior after distillation

Evaluation Results

Evaluation uses 1,000 games against a mixture of scripted and adaptive opponents.

Agent Win Rate Risk Metric
PPO only 58.4% Β± 2.1 Allocation collapse 14.2%
PPO + Distillation 67.9% Β± 1.8 Allocation collapse 8.8%
Full curriculum 78.4% Exploitability proxy 0.48
  • Allocation collapse: fraction of rounds placing >60% units on one field
  • Distillation yields a +9.5 point win-rate gain over PPO
  • Full curriculum yields +20 point gain with reduced over-specialization

These improvements arise from risk calibration and opponent-aware adaptation, not brute-force exploitation.


Repository Contents

Policy Checkpoints

  • policy_models/policy_after_ppo.pt
  • policy_models/policy_after_distill.pt
  • policy_models/policy_final.pt

LLM Teacher Models

  • sft_model/ – supervised fine-tuned model
  • dpo_model/ – preference-aligned model

Configuration and Logs

  • master_config.json – training configuration
  • battleground_eval.json – evaluation summaries

Usage

Load Policy

import torch
from policy import GraphPolicy

policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()


### Loading Fine-tuned LLM

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")

# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)

πŸŽ“ Research Context

This work targets the NeurIPS 2025 MindGames Workshop with a focus on:

  • Language models function effectively as strategic prior generators when grounded by rollouts
  • Graph-based representations enable cross-strategy generalization under compact policies
  • Distillation transfers high-level reasoning into fast, deployable agents

Key Innovations

  1. Heterogeneous Graph Representation: Novel graph structure for Blotto game states
  2. Ground-truth Counterfactual Learning: Exploiting game determinism
  3. Multi-scale Representation: Field-level, round-level, and game-level embeddings
  4. LLM-to-RL Distillation: Transferring strategic reasoning to efficient policies

πŸ“„ License

MIT License - See LICENSE file for details

πŸ™ Acknowledgments

  • Built for NeurIPS 2025 MindGames Workshop
  • Uses PyTorch, HuggingFace Transformers, and PEFT
  • Training infrastructure: NVIDIA H200 GPU

Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} Uploaded from: Notebook Environment

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading