Codenames: Graph-Based RL with LLM-Guided Preference Distillation
This repository contains trained Codenames agents developed for the NeurIPS 2025 MindGames Workshop.
The system combines a structured graph-based reinforcement learning policy with LLM-guided preference learning and distillation, targeting improved risk calibration and decision robustness.
Overview
The approach integrates:
- Graph Neural Networks for structured board and history representation
- Proximal Policy Optimization (PPO) for policy learning
- Role-conditioned decoding for spymaster and operative behaviors
- Rollout-grounded preference learning using large language models
- Supervised fine tuning (SFT) and Direct Preference Optimization (DPO) for teacher alignment
- Knowledge distillation from the aligned teacher back into a compact policy
The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play.
Game Configuration
- Game: Codenames
- Board size: 25 words
- Roles: Spymaster and Operative
- Evaluation games: 600 full episodes
- Opponents: Scripted baseline agents
Policy Architecture
Graph-Based State Encoder
- Heterogeneous graph with 30β40 nodes
- Node types include:
- Word nodes with semantic and state features
- Historical clue nodes
- Global summary node
- Node feature dimension: 35
- Encoder:
- 3 Graph Attention layers
- 6 attention heads
- Hidden size 192
Role Conditioning
- Shared policy trunk
- Role-conditioned action decoding:
- Clue generation and constraint handling for spymaster
- Guess selection and stopping decisions for operative
Model Size
- Total parameters: ~6.8M
- Enables fast inference under competitive constraints
Training Pipeline
Training follows a multi-stage curriculum:
Graph PPO Pretraining
- PPO with clip ratio 0.2
- Discount factor Ξ³ = 0.99
- GAE Ξ» = 0.95
- Trained against scripted Codenames agents
Preference Generation via Rollouts
- ~800 intermediate states sampled
- Candidate actions proposed by:
- Llama 3.1 Instruct
- Qwen 2.5 Instruct
- Each proposal evaluated using multiple stochastic rollouts
- Higher-return actions labeled preferred
Teacher Alignment
- Supervised Fine Tuning on chosen actions
- Direct Preference Optimization using frozen reference model
Policy Distillation
- Aligned teacher generates state-and-role to action labels
- Graph policy trained via cross-entropy imitation
PPO Refinement
- PPO resumes using environment rewards
- Stabilizes policy after distillation
Evaluation Results
Evaluation uses 600 full games against scripted opponents.
| Agent | Win Rate | Assassin Rate |
|---|---|---|
| Graph PPO | 44.8% | 12.6% |
| PPO + Distillation | 52.9% | 6.9% |
- Distillation yields an 8.1 point absolute win-rate improvement
- Assassin-triggered losses are reduced by 45%
- Improvements arise primarily from better risk calibration, not increased guessing aggressiveness
Repository Contents
Policy Checkpoints
policy_models/policy_after_ppo.ptpolicy_models/policy_after_distill.pt
Teacher Models
sft_model/β supervised fine-tuned teacherdpo_model/β preference-aligned teacher
Configuration and Logs
master_config.jsonevaluation_results.json
Usage
Load Policy
import torch
from policy import GraphPolicy
policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt"))
policy.eval()
Loading Fine-tuned LLM
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")
# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
π Research Context
This work targets the NeurIPS 2025 MindGames Workshop with a focus on:
- Language models provide useful strategic priors when grounded by rollouts
- Graph-based representations enable structured reasoning in semantic games
- Distillation transfers high-level reasoning into efficient, deployable agents
Key Innovations
- Heterogeneous Graph Representation: Novel graph structure for Blotto game states
- Ground-truth Counterfactual Learning: Exploiting game determinism
- Multi-scale Representation: Field-level, round-level, and game-level embeddings
- LLM-to-RL Distillation: Transferring strategic reasoning to efficient policies
π License
MIT License - See LICENSE file for details
π Acknowledgments
- Built for NeurIPS 2025 MindGames Workshop
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU
Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} Uploaded from: Notebook Environment