Codenames: Graph-Based RL with LLM-Guided Preference Distillation

This repository contains trained Codenames agents developed for the NeurIPS 2025 MindGames Workshop.
The system combines a structured graph-based reinforcement learning policy with LLM-guided preference learning and distillation, targeting improved risk calibration and decision robustness.

Overview

The approach integrates:

Graph Neural Networks for structured board and history representation
Proximal Policy Optimization (PPO) for policy learning
Role-conditioned decoding for spymaster and operative behaviors
Rollout-grounded preference learning using large language models
Supervised fine tuning (SFT) and Direct Preference Optimization (DPO) for teacher alignment
Knowledge distillation from the aligned teacher back into a compact policy

The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play.

Game Configuration

Game: Codenames
Board size: 25 words
Roles: Spymaster and Operative
Evaluation games: 600 full episodes
Opponents: Scripted baseline agents

Policy Architecture

Graph-Based State Encoder

Heterogeneous graph with 30–40 nodes
Node types include:
- Word nodes with semantic and state features
- Historical clue nodes
- Global summary node
Node feature dimension: 35
Encoder:
- 3 Graph Attention layers
- 6 attention heads
- Hidden size 192

Role Conditioning

Shared policy trunk
Role-conditioned action decoding:
- Clue generation and constraint handling for spymaster
- Guess selection and stopping decisions for operative

Model Size

Total parameters: ~6.8M
Enables fast inference under competitive constraints

Training Pipeline

Training follows a multi-stage curriculum:

Graph PPO Pretraining
- PPO with clip ratio 0.2
- Discount factor γ = 0.99
- GAE λ = 0.95
- Trained against scripted Codenames agents
Preference Generation via Rollouts
- ~800 intermediate states sampled
- Candidate actions proposed by:
  - Llama 3.1 Instruct
  - Qwen 2.5 Instruct
- Each proposal evaluated using multiple stochastic rollouts
- Higher-return actions labeled preferred
Teacher Alignment
- Supervised Fine Tuning on chosen actions
- Direct Preference Optimization using frozen reference model
Policy Distillation
- Aligned teacher generates state-and-role to action labels
- Graph policy trained via cross-entropy imitation
PPO Refinement
- PPO resumes using environment rewards
- Stabilizes policy after distillation

Evaluation Results

Evaluation uses 600 full games against scripted opponents.

Agent	Win Rate	Assassin Rate
Graph PPO	44.8%	12.6%
PPO + Distillation	52.9%	6.9%

Distillation yields an 8.1 point absolute win-rate improvement
Assassin-triggered losses are reduced by 45%
Improvements arise primarily from better risk calibration, not increased guessing aggressiveness

Repository Contents

Policy Checkpoints

policy_models/policy_after_ppo.pt
policy_models/policy_after_distill.pt

Teacher Models

sft_model/ – supervised fine-tuned teacher
dpo_model/ – preference-aligned teacher

Configuration and Logs

master_config.json
evaluation_results.json

Usage

Load Policy

import torch
from policy import GraphPolicy

policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt"))
policy.eval()

Loading Fine-tuned LLM

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")

# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)

🎓 Research Context

This work targets the NeurIPS 2025 MindGames Workshop with a focus on:

Language models provide useful strategic priors when grounded by rollouts
Graph-based representations enable structured reasoning in semantic games
Distillation transfers high-level reasoning into efficient, deployable agents

Key Innovations

Heterogeneous Graph Representation: Novel graph structure for Blotto game states
Ground-truth Counterfactual Learning: Exploiting game determinism
Multi-scale Representation: Field-level, round-level, and game-level embeddings
LLM-to-RL Distillation: Transferring strategic reasoning to efficient policies

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Built for NeurIPS 2025 MindGames Workshop
Uses PyTorch, HuggingFace Transformers, and PEFT
Training infrastructure: NVIDIA H200 GPU

Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} Uploaded from: Notebook Environment

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning