Spaces:
Paused
Paused
A newer version of the Gradio SDK is available:
6.1.0
Analysis: Why Accuracy Drops and How to Fix
Issue 1: Accuracy Drops at End β
Root Causes Found:
Evaluation uses NEW tasks each time (line 171-175 in compare_strategies.py)
general_accuracy = student.evaluate([generator.generate_task(...) for ...])- Creates new tasks every iteration β variance and inconsistency
- Should use FIXED eval set
Forgetting rate too aggressive for 500 iterations
- Forgetting rate: 0.05
- After 500 iterations (500 time units): retention = exp(-0.05 * 500) β 0.0
- All skills forgotten by the end!
- Retention drops to near-zero after ~50-100 time units
Evaluation timing confusion
- Currently: Learn β Evaluate β Advance time
- Should be clearer about when evaluation happens relative to forgetting
Issue 2: Accuracy Calculation Method
Current Method:
- Uses
student.evaluate(eval_tasks)which:- Calls
answer()for each task (stochastic, uses randomness) - Accounts for forgetting via
_get_effective_skill() - Returns fraction of correct answers
- Calls
Problems:
- Stochastic variance: Random sampling introduces noise
- Eval tasks regenerated: Different tasks each time = inconsistent
- Small eval set: Only 10-15 tasks = high variance
Better Methods:
- Use FIXED eval set generated once at start
- Use expected accuracy instead of sampled (less variance)
- Expected acc = mean(prob_correct) over all tasks
- Larger eval set (50-100 tasks) for stability
- Separate eval timing: Evaluate BEFORE time advance
Issue 3: Mock vs Real Components
Current Mock Components:
Mock Student:
- β Captures learning and forgetting
- β Per-topic skill tracking
- β Realistic Ebbinghaus curve
- β Simplified learning model (linear skill increase)
- β Stochastic but not as complex as real PPO
Mock Task Generator:
- β Simple template-based tasks
- β Multiple topics and difficulties
- β Fixed templates (not procedural)
- β Limited diversity
Real Components (in MentorFlow):
- Student: Full PPO agent with neural network
- Task Generator: Procedural generation with 15 task families
Will Real Components Be Better?
YES, likely:
- Real PPO student can learn more complex patterns
- Procedural task generator provides more diverse tasks
- Better generalization to unseen tasks
- More realistic learning curves
BUT:
- Real components are slower to train
- Harder to debug and verify
- Teacher agent algorithm (UCB) should still work
Recommended Fixes
- Fix evaluation to use FIXED eval sets
- Reduce forgetting rate or reset time periodically
- Use expected accuracy for more stable measurements
- Add evaluation BEFORE time advance option
- Document evaluation methodology clearly