--- license: mit datasets: - PrimeIntellect/Reverse-Text-RL language: - en base_model: - Qwen/Qwen3-0.6B - PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT --- # Reverse Text Model Qwen3-0.6B Simple model that was RL FT for 20 steps / epochs after SFT to reverse text using [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl/) (RL Training) and [reverse-text](https://github.com/PrimeIntellect-ai/prime-environments/tree/main/environments/reverse_text) (RL Environment). See the improvement in results: ## Comparison with SFT (base) model The reward (correctness score) distribution has improved for the RLFT model across all rollouts. ![](comparison.png) At an instance level, if we compare the best scores across rollouts, we see a mean improvement of 3.73%. But a maximum of ~30% and reduction of ~3% ![](instance-level.png) ## Example Prompt & Reward **Task:** `reverse-text` **Prompt:** - **System:** “Reverse the text character-by-character. Put your answer in `` tags.” - **User:** “The community in Bruck was merged into it” **Expected Completion:** ```text .ti otni degrem saw kcuBr ni ytinummoc ehT ``` **Expected Reward:** 0.963855421686747 Note: Reward is basd on the long common subsequence