Codenames: Graph-Based RL with LLM-Guided Preference Distillation

Status Framework License

This repository contains trained Codenames agents developed for the NeurIPS 2025 MindGames Workshop.
The system combines a structured graph-based reinforcement learning policy with LLM-guided preference learning and distillation, targeting improved risk calibration and decision robustness.


Overview

The approach integrates:

  • Graph Neural Networks for structured board and history representation
  • Proximal Policy Optimization (PPO) for policy learning
  • Role-conditioned decoding for spymaster and operative behaviors
  • Rollout-grounded preference learning using large language models
  • Supervised fine tuning (SFT) and Direct Preference Optimization (DPO) for teacher alignment
  • Knowledge distillation from the aligned teacher back into a compact policy

The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play.


Game Configuration

  • Game: Codenames
  • Board size: 25 words
  • Roles: Spymaster and Operative
  • Evaluation games: 600 full episodes
  • Opponents: Scripted baseline agents

Policy Architecture

Graph-Based State Encoder

  • Heterogeneous graph with 30–40 nodes
  • Node types include:
    • Word nodes with semantic and state features
    • Historical clue nodes
    • Global summary node
  • Node feature dimension: 35
  • Encoder:
    • 3 Graph Attention layers
    • 6 attention heads
    • Hidden size 192

Role Conditioning

  • Shared policy trunk
  • Role-conditioned action decoding:
    • Clue generation and constraint handling for spymaster
    • Guess selection and stopping decisions for operative

Model Size

  • Total parameters: ~6.8M
  • Enables fast inference under competitive constraints

Training Pipeline

Training follows a multi-stage curriculum:

  1. Graph PPO Pretraining

    • PPO with clip ratio 0.2
    • Discount factor Ξ³ = 0.99
    • GAE Ξ» = 0.95
    • Trained against scripted Codenames agents
  2. Preference Generation via Rollouts

    • ~800 intermediate states sampled
    • Candidate actions proposed by:
      • Llama 3.1 Instruct
      • Qwen 2.5 Instruct
    • Each proposal evaluated using multiple stochastic rollouts
    • Higher-return actions labeled preferred
  3. Teacher Alignment

    • Supervised Fine Tuning on chosen actions
    • Direct Preference Optimization using frozen reference model
  4. Policy Distillation

    • Aligned teacher generates state-and-role to action labels
    • Graph policy trained via cross-entropy imitation
  5. PPO Refinement

    • PPO resumes using environment rewards
    • Stabilizes policy after distillation

Evaluation Results

Evaluation uses 600 full games against scripted opponents.

Agent Win Rate Assassin Rate
Graph PPO 44.8% 12.6%
PPO + Distillation 52.9% 6.9%
  • Distillation yields an 8.1 point absolute win-rate improvement
  • Assassin-triggered losses are reduced by 45%
  • Improvements arise primarily from better risk calibration, not increased guessing aggressiveness

Repository Contents

Policy Checkpoints

  • policy_models/policy_after_ppo.pt
  • policy_models/policy_after_distill.pt

Teacher Models

  • sft_model/ – supervised fine-tuned teacher
  • dpo_model/ – preference-aligned teacher

Configuration and Logs

  • master_config.json
  • evaluation_results.json

Usage

Load Policy

import torch
from policy import GraphPolicy

policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt"))
policy.eval()

Loading Fine-tuned LLM

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")

# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)

πŸŽ“ Research Context

This work targets the NeurIPS 2025 MindGames Workshop with a focus on:

  • Language models provide useful strategic priors when grounded by rollouts
  • Graph-based representations enable structured reasoning in semantic games
  • Distillation transfers high-level reasoning into efficient, deployable agents

Key Innovations

  1. Heterogeneous Graph Representation: Novel graph structure for Blotto game states
  2. Ground-truth Counterfactual Learning: Exploiting game determinism
  3. Multi-scale Representation: Field-level, round-level, and game-level embeddings
  4. LLM-to-RL Distillation: Transferring strategic reasoning to efficient policies

πŸ“„ License

MIT License - See LICENSE file for details

πŸ™ Acknowledgments

  • Built for NeurIPS 2025 MindGames Workshop
  • Uses PyTorch, HuggingFace Transformers, and PEFT
  • Training infrastructure: NVIDIA H200 GPU

Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} Uploaded from: Notebook Environment

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading